Professional Documents
Culture Documents
PAPER
ABSTRACT | Spoken language recognition refers to the auto- range of multilingual speech processing applications, such
matic process through which we determine or verify the as spoken language translation [141], multilingual speech
identity of the language spoken in a speech sample. We study recognition [116], and spoken document retrieval [23]. It is
a computational framework that allows such a decision to be also a topic of great interest in the areas of intelligence and
made in a quantitative manner. In recent decades, we have security for information distillation.
made tremendous progress in spoken language recognition, Humans are born with the ability to discriminate be-
which benefited from technological breakthroughs in related tween spoken languages as part of human intelligence
areas, such as signal processing, pattern recognition, cognitive [147]. The quest to automate such ability has never stopped
science, and machine learning. In this paper, we attempt to [69], [70], [89], [96], [150]. Just like any other artificial
provide an introductory tutorial on the fundamentals of the intelligence technologies, spoken language recognition
theory and the state-of-the-art solutions, from both phonolog- aims to replicate such human ability through computa-
ical and computational aspects. We also give a comprehensive tional means. The invention of digital computers has made
review of current trends and future research directions using this possible. A key question is how to scientifically mea-
the language recognition evaluation (LRE) formulated by the sure the individuality of the diverse spoken languages in
National Institute of Standards and Technology (NIST) as the the world. Today, automatic spoken language recognition
case studies. is no longer a part of science fiction. We have seen it being
deployed for practical uses [23], [81], [116], [141].
KEYWORDS | Acoustic features; calibration; classifier; fusion; It is estimated that there exist several thousands of
language recognition evaluation (LRE); phonotactic features; spoken languages in the world [29], [30], [58]. The recent
spoken language recognition; tokenization; vector space edition of the Ethnologue, a database describing all known
modeling living languages [71], has documented 6909 living spoken
languages. Text-based language recognition has tradition-
ally relied on distinct textual features of languages such as
I. INTRODUCTION words or letter substrings. It was arguably established in
Spoken language recognition refers to the automatic pro- 1967 by Gold [41] as a closed class experiment, in which
cess that determines the identity of the language spoken in human subjects were asked to classify a given test docu-
a speech sample. It is an enabling technology for a wide ment into one of the languages. It was not until the 1990s,
however, when people resorted to statistical techniques
Manuscript received May 22, 2012; revised September 19, 2012; accepted December 10,
that formulate the problem as a text categorization prob-
2012. Date of publication February 6, 2013; date of current version April 17, 2013. lem [3], [22], [37]. For languages that use the Latin
H. Li is with the Institute for Infocomm Research, Agency for Science, Technology,
and Research (A*STAR), Singapore 138632, and also with the University of New South
alphabet, text-based language recognition has attained
Wales, Kensington, NSW 2052, Australia (e-mail: hli@i2r.a-star.edu.sg). reasonably good performance, thus it is considered a
B. Ma and K. A. Lee are with the Institute for Infocomm Research, Agency for
Science, Technology, and Research (A*STAR), Singapore 138632 (e-mail:
solved problem [83]. As words and letter substrings are the
mabin@i2r.a-star.edu.sg; kalee@i2r.a-star.edu.sg). manifestation of lexical–phonological rules of a language,
Digital Object Identifier: 10.1109/JPROC.2012.2237151 this research has led to the conjecture that a spoken
1136 Proceedings of the IEEE | Vol. 101, No. 5, May 2013 0018-9219/$31.00 2013 IEEE
Li et al.: Spoken Language Recognition: From Fundamentals to Practice
language can be characterized by its lexical–phonological We will provide insights into the fundamental prin-
constraints. ciples of the research problem, and a practical applicability
In practice, spoken language recognition is far more analysis of different state-of-the-art solutions to language
challenging than text-based language recognition because recognition. We will also make an attempt to establish the
there is no guarantee that a machine is able to transcribe connection across different techniques. Advances in fea-
speech to text without errors. We know that humans re- ture extraction, acoustic modeling, phone n-gram model-
cognize languages through a perceptual or psychoacoustic ing, phone recognizers, phone lattice generation, vector
process that is inherent in the auditory system. Therefore, space modeling, intersession compensation, and score ca-
the type of perceptual cues that human listeners use is libration and fusion have contributed to the state-of-the-art
always the source of inspiration for automatic spoken performance [13], [73], [121]. To benefit from a wealth of
language recognition [147]. publications related to the National Institute of Standards
Human listening experiments suggest that there are and Technology (NIST) language recognition evaluations
two broad classes of language cues: the prelexical infor- (LREs) [85], [86], [88], [89], [102], we will demonstrate
mation and the lexical semantic knowledge [147]. Phonetic the working principles of various techniques using the
repertoire, phonotactics, rhythm, and intonation are all NIST LREs as case studies.
parts of the prelexical information [108]. Although one The rest of the paper is organized as follows. We will
may consider phones and phonotactics as segmental, and elaborate on the principles, problems, and recent advances
rhythm and intonation as suprasegmental, they are all very in language modeling in Section II. We will introduce the
different from the lexical semantic knowledge that the phonotactic approaches and acoustic approaches in
words encapsulate, such as semantic meanings and con- Sections III and IV, respectively. Advanced topics in sys-
ceptual representations. There is no doubt that both pre- tem developments related to Gaussian back–ends, multi-
lexical and lexical semantic knowledge contribute to the class logistic regression, score calibration and fusion, and
human perceptual process for spoken language recog- language detection will be discussed in Section V.
nition [108]. Section VI will be devoted to the NIST LRE paradigm,
Studies in infants’ listening experiments have revealed and, finally, an overview of current trends and future re-
that when infants have not gained a great deal of lexical search directions will be discussed in Section VII.
knowledge, they successfully rely on prelexical cues to
discriminate against languages [108]. In fact, when an
adult is dealing with two unfamiliar languages, one can II. LANGUAGE RECOGNITION
only use prelexical information. But as the infant’s lan- PRINCIPL ES
guage experience enriches or as the adult is handling
familiar languages, lexical semantic information will start A. Principles of Language Characterization
to play a very important, sometimes determining, role The first perceptual experiment measuring how well
[147]. While we know that it requires a major effort to human listeners can perform language identification was
command the lexical usage of an entirely new language, in reported in [96]. It was concluded that human beings, with
human listening studies, we have observed that human adequate training, are the most accurate language recog-
subjects are able to acquire prelexical information rapidly nizers. This observation still holds after 15 years as con-
for language recognition purposes. firmed again in [137], provided that the human listeners
The relative importance of perceptual cues for lan- speak the languages.
guage recognition has always been a subject of debate. For languages that they are not familiar with, human
Studies in automatic spoken language recognition have listeners can often make subjective judgments with refer-
confirmed that acoustic and phonotactic features are the ence to the languages they know, e.g., it sounds like
most effective language cues [97], [150]. This coincides Arabic, it is tonal like Mandarin or Vietnamese, or it has a
with the findings from human listening experiments [96], stress pattern like German or English. Though such judge-
[137]. While the term ‘‘acoustic’’ refers to physical sound ments are less precise for hard decisions to be made for an
patterns, the term ‘‘phonotactic’’ refers to the constraints identification task, they show how human listeners apply
that determine permissible syllable structures in a lan- linguistic knowledge at different levels for distinguishing
guage. We can consider acoustic features as the proxy of between certain broad language groups, such as tonal
phonetic repertoire and call it acoustic–phonetic features. versus nontonal languages.
On the other hand, we see phonotactic features as the Studies also revealed that, given only little previous
manifestation of the phonotactic constraints in a language. exposure, human listeners can effectively identify a lan-
Therefore, in this paper, we would like to introduce the guage without much lexical knowledge. In this case,
spoken language recognition techniques that are mainly human listeners rely on prominent phonetic, phonotactic,
based on acoustic and phonotactic features, which repre- and prosody cues to characterize the languages [96], [137].
sent the mainstream research [87], [102], [150]. In what The set of language cues discussed above can be il-
follows, we use the term language recognition for brevity. lustrated according to their level of knowledge abstraction,
overlapping sets of phonemes. The same basis was used in for another three subjects speaking Portuguese of different
constructing the international phonetic alphabet (IPA). contents. A visual inspection easily confirms that the two
Though there are 6909 languages in the world [71], the languages are different in terms of phonetic repertoire and
total number of phones required to represent all the the frequency count of phones, while only slight variations
sounds of these languages ranges only from 200 to 300 [2]. are observed for different instances within the same lan-
Phones that are common in languages are grouped to- guage despite different speech contents. If we treat the
gether and given the same symbol in IPA. For instance, the phonetic histogram as an empirical distribution of the
GlobalPhone multilingual text and speech database [117] language, we may apply similarity measures to provide
uses a phone set consisting of 122 consonants and 114 quantifiable measurements between languages.
vowels. Three nonphonetic units (two noise models and We can also use the GlobalPhone database to study the
one silence model) are also defined for modeling non- phonotactic differences between languages by examining
speech events. The same phone set is used for transcribing how well a phone n-gram model of one language predicts
20 spoken languages in the GlobalPhone database. the phone sequence across different languages in terms of
A systematic study on the extent to which languages are perplexity [55]. A lower perplexity shows that a phone
separated in phonetic and phonotactic terms is possible n-gram matches better the phone sequence, in other words,
using the GlobalPhone database, in which languages are the phone sequence is more predictable. We expect a low
mapped to a common set of IPA symbols. Fig. 2 compares perplexity when the n-gram model and the test phone
and contrasts the phonetic histograms of two languages sequence are of the same language, while high perplexity is
that are arbitrarily selected from the GlobalPhone database expected otherwise. To this end, we first build a phone
in the form of polar plots. The size of the pie shape in the n-gram model [55] for each of the seven languages selected
figure indicates the normalized count of a phone in a from the GlobalPhone database that have been transcribed
speech utterance. The three plots in the upper panel were in IPA. We then evaluate the perplexity of the phone
obtained for three different subjects speaking Czech of n-gram models over some held-out test data for every lan-
different contents, while the plots in the lower panel are guage pair. Table 1 shows the perplexity measures between
Fig. 2. Polar histograms showing the phone distributions of (a)–(c) Czech and (d)–(f) Portuguese utterances for three different
native speakers in each language. Utterances in the same language are of different contents.
Table 1 Perplexity Measured Between Seven Languages Selected Arbitrarily From the GlobalPhone Multilingual Database Based on Bigram and
Trigram Models
languages for the cases of bigram and trigram in the upper languages involves an assignment of the most likely
and lower panels, respectively. The tabulated data clearly language label L^ to the acoustic observation O, such that
indicate that the lowest perplexity values are always ob-
served when the phone n-gram models and the test data are
from the same language. This observation confirms the L^ ¼ arg max pðOjLl Þ (1)
l
differences between languages in terms of phonotactics and
the effectiveness of using phone n-gram models to
quantitatively measure such differences. which follows a maximum-likelihood (ML) criterion. For
While the analysis of the GlobalPhone database was the case where the prior probabilities of observing individ-
conducted using perfectly transcribed phone sequences, as ual languages are not uniform, the maximum a posteriori
opposed to an automatic transcription, it illustrates a (MAP) criterion could be used [64]. For mathematical
quantifiable link between phonetic and phonotactic tractability, the language-specific density pðojLl Þ is always
features and language recognition. For the purpose of assumed to follow some functional forms, for which pa-
analysis as described above, phonetic transcriptions were rametric models could be used. In the simplest case, a
derived from orthographic transcriptions by means of Gaussian mixture model (GMM) [149], [150] can be used
pronunciation dictionaries, as defined in the GlobalPhone to model directly the distribution of the acoustic features,
database. Furthermore, we use the add-one smoothing as we will see in Section IV.
method [55] when training n-gram models to deal with the To deal with phones and phonotactic knowledge
zero-count problem caused by unseen or out-of-set phones sources, we assume that the speech waveform can be seg-
for a particular language (recall that the same phone set is mented into a sequence of phones Y b . Applying the ML
used across different languages in the GlobalPhone criterion, we have now the most likely language given by
database).
Fig. 3. General scheme of acoustic–phonetic and phonotactic approaches to automatic language recognition, where N indicates the
number of target languages.
In (2), we have assumed that we know the exact data. Depending on the information sources they model,
sequence of phones in the speech waveform O. In practice, these models could be 1) stochastic, e.g., Gaussian mixture
b has to be decoded from O by selecting the most likely
Y model (GMM) [149], [150] and hidden Markov model
one from all possible sequences. To this end, we hinge on a (HMM) [98], [149]; 2) deterministic, e.g., vector quanti-
set of phone models M, such as a hidden Markov model zation (VQ) [125], support vector machine (SVM) [16],
(HMM), to decode the waveform. The most likely phone [19], [31], [65], and neural network [7]; or 3) discrete
sequence could then be obtained using Viterbi decoding stochastic, e.g., n-gram [40], [46], [66], [67]. During the
test phase, a test utterance is compared to each of the
language-dependent models after going through the same
b ¼ arg max PðOjY; MÞ:
Y (3) preprocessing and feature extraction step. The likelihood
Y of the test utterance to each model is computed. The
language associated with the most likely model is hypo-
Putting (2) and (3) together, and considering all possible thesized as the language of the utterance in accordance
phone sequences instead of the single best hypothesis, we to the ML criterion as given in (1), (2), and (4) in
obtain a more general form Section II-B.
A wide spectrum of approaches has been proposed
X for modeling the characteristics of languages. In this
L^ ¼ arg max PðOjY; MÞPðYjLl Þ: (4) paper, we are particularly interested in the two most ef-
l 8Y fective ones: the acoustic–phonetic approach and the
phonotactic approach. Fig. 3 shows exemplars from these
two broad categories in one diagram illustrating their
More details can be folded in by having M be language common ground and differences. The advanced techniques
dependent or using lexical constraints in the phone have explored the combination of different features and
decoding [40], [118]. State-of-the-art language recognizers approaches.
always resort to simpler cases as in (1)–(3). In Sections III The acoustic–phonetic approach is motivated by the
and IV, we will give a detailed mathematical formulation observation that languages differ at a very fundamental
as to how modeling and the decision rule are implemented level in terms of phones and frequencies of these phones
in practice. occurring (i.e., the phonetic differences between lan-
guages). More importantly, it is assumed that the phonetic
C. Recent Advances characteristics could be captured by some form of spectral-
Contemporary language recognition systems operate in based features. We then proceed to model the distribution
two phases: training and the runtime test. During the of each language in the feature space. Notably, shifted-
training phase, speech utterances are analyzed and models delta cepstrum (SDC) used in conjunction with GMM has
are built based on the training data given the language shown to be very successful in this regard [131], where a
labels. The models are intended to represent some GMM is used to approximate the acoustic–phonetic distri-
language-dependent characteristics seen on the training bution of a language. It is generally believed that each
Gaussian density in a GMM captures some broad phonetic I II . PHONOT ACTIC APPROACHES
classes [110]. However, GMM is not intended to model the The first sustained effort using the phonotactic patterns
contextual or dynamic information of speech. was to distinguish languages by comparing the frequency
The phonotactic approach is motivated by the belief of occurrences of certain reference sounds or sound
that a spoken language can be characterized by its lexical– sequences with that of the target languages [69], [70].
phonological constraints. With phone transcriptions, we To do this, we need to first tokenize the running speech
build N phone n-gram models for an N language task. into sound units. This can be achieved by a conventional
Briefly, a phone n-gram model [55] is a stochastic model speech recognizer that employs phonetic classes, phone
with discrete entries, each describing the probability of a classes, or phones [4], [5], [44], [45], [62], [95], [133],
subsequence of n phones (more details in Section III). [143], [149] as the sound units. The first attempt with
Given a test utterance, each phone n-gram model produces phonotactic constraints for language recognition was con-
a likelihood score. The language of the most likely hypo- ducted using a Markov process to model sequences of
thesis represents the classification decision. The method is broad phonetic classes generated by a manual phonetic
referred to as the phone recognition followed by language transcription in eight languages [51]. The Markov pro-
modeling (PRLM) in the literature [150]. The languages cess with the broad phone classes was later applied to
used for the phone recognizers need not be the same, or real speech data for the language identification of five
may even be disjoint with any of those recognized. The languages [77]. Next, we discuss the tokenization
idea is analogous to the case where a human listener uses techniques and two common phonotactic modeling
his native language knowledge to describe unknown frameworks.
languages.
From a system development point of view, it is worth
A. Speech Tokenization
noting that the acoustic–phonetic approach requires only
A speech tokenizer converts an utterance into a
the digitized speech utterances and their language labels,
sequence of sound tokens. As illustrated in Fig. 4, a token
while the phonotactic approach requires the phonetic
is defined to describe a distinct acoustic–phonetic attri-
transcription of speech, which could be expensive to ob-
bute and can be of different sizes, ranging from a speech
tain. The two broad categories as described above are by no
frame, a subword unit such as a phone or a syllable, to a
means encompassing all possible algorithms and ap-
lexical word.
proaches available. Variants exist within and between
The GMM tokenization technique [130] operates at the
these broad categories, for examples, GMM tokenization
frame level. It converts a sequence of speech frames into a
[130], parallel phone recognition [150], universal phone
sequence of Gaussian labels, each of which gives rise to the
recognition (UPR) [73], articulatory attribute-based ap-
highest likelihood of the speech frame. This is similar to a
proach [122], [123], discriminatively trained GMM using
vector quantization process. As GMM tokenization does
maximum mutual information (MMI) [14], [142] or mini-
not attempt to relate a Gaussian label with an acoustic–
mum classification error (MCE) [54], [105], just to name
phonetic event, it does not require transcribed speech
a few.
data for model training. Unfortunately, the interframe
State-of-the-art language recognition systems consist of
multiple subsystems in parallel, where individual scores
are combined via a postprocessing back–end, as shown in
Fig. 3. The motivation of score fusion is to harness the
complementary behavior among subsystems provided that
their errors are not completely correlated. Also appearing
at the score postprocessing back–end is a calibration stage,
the purpose of which is to ensure that individual scores are
consistently meaningful across utterances. In particular,
score calibration has been shown to be essential for a
language detection task where decisions have to be made
using a fixed threshold for all target languages given test
segments with arbitrary in-set and out-of-set languages
[85]–[87]. On the other hand, well-calibrated scores also
make it easier to map unconstrained scores to a normalized
range which can be viewed as confidence scores for
downstream consumers. Recent works reported in [8], [9],
and [12] have shown a unified framework where score
calibration and fusion are performed jointly using the Fig. 4. Tokenization of speech at different levels with tokens of
multiclass logistic regression. We elaborate further on different sizes, ranging from a speech frame, a phone, to a
score calibration and fusion in Section V. lexical word.
represents a discrete empirical distribution of phonotactic approach, we attempt to model the acoustic–phonetic distri-
counts. An L2 inner product kernel [18] is given by bution of a language using the acoustic features.
The early efforts of acoustic–phonetic approaches to
language recognition probably began in the 1980s. A poly-
X
B
nomial classifier based on 100 features derived from linear
ðx; yÞ ¼ xT y ¼ xi yi (8)
i¼1
predictive coding (LPC) analysis [27] was studied for re-
cognition of eight languages. A VQ technique with pitch
contour and formant frequency features was proposed for
which measures the similarity between two bag-of-sounds recognition of three languages [38], [42]. A study using
vectors. multiple language-dependent VQ codebooks and a univer-
For each target language, an SVM is trained using the sal VQ codebook (i.e., general and language independent)
bag-of-sounds vectors pertaining to the target language as was conducted over the LPC derived features for recog-
the positive set, with those from all other languages as the nition of 20 languages [125]. This study suggests that
negative set. In this way, N one-versus-the-rest SVM mod- acoustic features are effective in different settings. A com-
els are built, one for each target language. Given a test parative study among four modeling techniques, VQ, dis-
utterance, N SVM output scores will be generated for crete HMM, continuous-density HMM (CDHMM), and
language recognition. Alternatively, the N SVM scores can GMM with mel-cepstrum coefficients, was carried out
be used to produce an N-dimensional score vector [82], over four languages, showing that CDHMM and GMM
and thus we are able to project the high-dimension com- offer a better performance [98].
posite feature vectors into a much lower dimension of N. In this section, we will discuss feature extraction, two
The generated N-dimensional score vectors can be fur- modeling techniques, and intersession variability compen-
ther modeled by a Gaussian back–end for classification sation that all affect the system performance.
decision [151].
X
M
pðojÞ ¼ !i N ðojMi ; 2i Þ: (11)
i¼1
The SVM is typically used to separate vectors in a bi- is the average expansion over all the feature vectors of the
nary classification problem. It projects an input vector x input utterance O ¼ foð1Þ; oð2Þ; . . . ; oðTÞg. The matrix R
into a scalar value f ðxÞ, as follows: is the correlation matrix obtained from a background data
set. An approximation of R can be applied to calculate only
the diagonal terms [16].
X
I
While we consider a GLDS kernel as a nonparametric
f ðxÞ ¼ i ðx; xi Þ þ (13)
approach to the language recognition problem, the GMM
i¼1
supervector offers a parametric alternative. Given a UBM
and a speech utterance drawn from a language, we derive a
where vectors xi are support vectors, I is the number of GMM supervector m ¼ ½M1 ; M2 ; . . . ; MM T by stacking the
support vectors, i are the weights, and is a bias. The mean vectors of all the adapted mixture components [18].
weights imposed on the support vectors are constrained In this way, a speech utterance is mapped to a high-
P
such that Ii¼1 i ¼ 0 and i 6¼ 0. Function ðÞ is the dimensional space using the mean parameters of a GMM.
kernel, subject to certain properties (the Mercer condi- It should be noted that the MAP adaptation is performed
tion), so that it can be expressed as for each and every utterance available for a particular lan-
guage. For two speech utterances OX and OY , two GMMs
can be derived using MAP adaptation from the UBM to
ðx; yÞ ¼ T ðxÞðyÞ X T
(14) obtain two supervectors mX ¼ ½MX X
1 ; M2 ; . . . ; MM and
T
mY ¼ ½MY Y Y
1 ; M2 ; . . . ; MM . There are a number of ways to
compare two speech utterances in terms of the derived
where ðxÞ is a mapping from the input space to a possibly mean vectors. A natural choice is the Kullback–Leibler
infinite-dimensional space. Now, the question is: How do (KL) divergence, which can be approximated by the fol-
we represent a speech utterance, thus a language, in high- lowing upper bound [18]:
dimensional vector space? We will discuss two different
vector space modeling techniques that characterize a
speech utterance with spectral features and GMM param-
1X M
Y T 1
eters, respectively. dðmX ; mY Þ ¼ wi MX
i Mi 2i MX Y
i Mi : (17)
After acoustic feature extraction, a speech utterance 2 i¼1
has become a sequence of spectral feature vectors. Com-
paring two speech utterances with the SVM, there needs to
be a way of taking two such sequences of feature vectors, We can then find the corresponding inner product that
calculating a sequence kernel operation, and computing serves as a kernel function, i.e., the so-called KL kernel
the SVM output. A successful approach using the sequence [18], as follows:
kernel is the generalized linear discriminative sequence
(GLDS) [16]. It takes the explicit polynomial expansion of
M
pffiffiffiffiffi 1=2 X T pffiffiffiffiffi 1=2 Y
the input feature vectors and applies the sequence kernel X
based on generalized linear discriminants. The polynomial KL ðOX ; OY Þ ¼ wi 2 i Mi wi 2i Mi :
expansion includes all the monomials of the features in a i¼1
feature vector. (18)
As an example, for two features in the input feature
vector o ¼ ½o1 ; o2 T , the monomials with degree two are
T
bðoÞ ¼ ½1; o1 ; o2 ; o21 ; o1 o2 ; o22 . Let OX and OY be two Note that the KL kernel only accommodates adaptation
input sequences of feature vectors. The polynomial discri- of GMM to mean vectors while leaving the covariance
minant can be obtained using the mean-squared error matrices unchanged. A Bhattacharyya kernel [145] was
criterion [16]. The resulting sequence kernel can be ex- proposed that allows for adaptation of covariance matrices,
pressed as follows: showing an improved performance. Similar to the KL
kernel, the Bhattacharyya kernel represents speech utter-
ances as the adapted mean vectors of GMMs
~ T R1 b
ðOX ; OY Þ ¼ b ~Y (15)
X
" #T
M X
X ~ i
1=2
where the vector 2 þ2
BHATT ðOX ; OY Þ ¼ i
MX
i M ~i
i¼1
2
" #
XT
2Y ~
1=2
~¼1 i þ 2i Y
b bfoðtÞg (16) Mi M
~ i : (19)
T t¼1 2
The difference lies at the adapted covariance matrices 2X i based on MAP estimation [80]. In order to conduct
and 2Y i , which appear as the normalization factors to the feature-domain compensation, (20) can be rewritten as
adapted mean vectors [145]. Here, M ~ i are, respec-
~ i and 2
tively, the mean vectors and covariance matrices of the
UBM. In [19], the KL kernel was extended to include Mi ¼ M
~ i þ Ui z (21)
covariance matrices into the supervector. An interesting
interaction between the GMM supervector and the SVM is
that SVM parameters can be pushed back to GMM models where Mi and M ~ i are the mean vectors of the ith Gaussian
that allow for a more effective scoring. component of the GMM and the UBM, respectively. Sub-
Finally, for each target language, a one-versus-the-rest matrix Ui is the intersession compensation offset related
SVM can be trained in the vector space with the target to the ith Gaussian component. The compensation of fea-
language being the positive set and all other competing ture vector oðtÞ is obtained by subtracting a weighted sum
languages being the negative set. A decision strategy can be of the intersession compensation bias vector
devised to summarize the outputs of multiple one-versus-
the-rest SVMs during the runtime test, as suggested in
Section III-C. X
M
o~ðtÞ ¼ oðtÞ i Ui z (22)
i¼1
D. Intersession Variability
In language recognition, we face several sources of
nonlanguage variability, such as speaker, gender, channel, where i is the Gaussian occupation probability.
and environment, in the speech signals. For simplicity, we Most recently, another factor analysis technique,
refer to all such variabilities as intersession variability, that i-vector [34], has become very popular in speaker recog-
is, the variability exhibited by a given language from one nition and has been also introduced to language recognition
recording session to another. The class of subspace meth- [35]. The i-vector paradigm provides an efficient way to
ods for model compensation, such as joint factor analysis compress GMM supervectors by confining all sorts of
(JFA) [21], [57], [140] and nuisance attribute projection variabilities (both language and nonlanguage) to a low-
(NAP) [124], which originated from speaker recognition dimensional subspace, referred to as the total variability
research, is useful to address the variability issues. The space. The generative equation is given by
idea of the subspace techniques is to model and eliminate
the nuisance subspace pertaining to intersession variabil-
ity, thereby reduce the mismatch between training and m¼m
~ þ Tw (23)
test. Recent efforts have brought the subspace methods
from model domain to feature domain, such as the
where matrix T is the so-called total variability matrix.
feature-level latent factor analysis (fLFA) [56], [139] and
The latent variable w is taken to be a low-dimensional
feature-level NAP (fNAP) [20] that have shown superior
random vector with a standard normal distribution. For a
performance in language recognition. Feature domain
speech utterance O, its i-vector is given by the posterior
compensation is not tied to a specific model assumption,
mean of the latent variable w, i.e., EfwjOg. Since T is
thus, it could be used for a wider range of modeling
always a low-rank rectangular matrix, the dimensionality
schemes. We explain the working principle of fLFA as
of the i-vector is much smaller compared to that of the
follows.
supervector m.
We assume that the GMM supervector m, derived for a
One subtle difference between the total variability
speech utterance, can be decomposed into the sum of two
space T and the session variability space U, as given in
supervectors
(20), is that the former captures both the intersession
variability as well the intrinsic variability between lan-
m¼m
~ þ Uz (20) guages, which is not represented in the latter. As such,
matrix U is used to remove the unwanted intersession
variability, whereas the purpose of matrix T is to preserve
where m ~ is the UBM mean supervector, U is a low-rank the important variability in a much more compact form
matrix projecting the latent factor subspace into the su- than the original supervector. Compensation techniques
pervector model domain, and z is a low-dimensional vector are then applied on the i-vectors to cope with the inter-
holding the latent factors for the current speech utterance session variability. To this end, the linear discriminant
and language [139]. Matrix U can be estimated using an analysis has been shown to be effective [35]. It is worth
EM training based on the principal component analysis noting that, though labeled data are not required for
[126], and the latent factors z are estimated for each ses- training the total variability subspace pertaining to the
sion using the probabilistic subspace adaptation method i-vectors, the subsequent compensation techniques can
only be effective when training data for the intended share a common covariance matrix, we form what is called
intersession variability are available. the linear back–end, as follows:
Gaussian model using score vectors for the ðN þ 1Þth In the above equation, M is the number of training samples
language, a new target language, as long as we are given and ml is the training label, i.e.,
some developmental data from this language. This is also
particularly useful to model out-of-set languages. To this
end, the score vectors associated with nontarget languages wl ; if labeled as language Ll
ml ¼ (29)
are pooled and used to train a Gaussian model. Out-of-set 0; otherwise.
languages are then treated as an additional class represent-
ing the none-of-the-above hypothesis.
Here, wl for l ¼ 1; 2; . . . ; N are weighting factors used to
normalize the class proportion in the training data. If there
B. Multiclass Logistic Regression are an equal number of training samples for each target
Another way of combining likelihood scores from mul- language (or normalization is not intended), then we set
tiple subsystems is the product rule [104], [150], or equiv- wl ¼ 1. With wl ¼ M=ðMl NÞ, we normalize out the effect
alently, the sum of log likelihoods as follows: of class population Ml from the cost function, where N is
the number of target languages. No closed-form solution
exists to maximize (28). A conjugate gradient–descent
X
K X
K
optimization technique was found to be the most efficient
s0l ¼ log SðOjk;l Þ ¼ sðk; lÞ (25) among several other gradient-based numerical optimiza-
k¼1 k¼1
tion techniques [10], [93].
Notice that calibration is achieved jointly with score
for l ¼ 1; . . . ; N. The underlying assumption in (25) is that fusion via (26)–(29), where subsystem scores are com-
the output scores SðOjk;l Þ from the models k;l can be bined in the first term on the right-hand side of (26), while
interpreted as likelihood measures and the subsystems are score calibration is achieved by scaling and shifting of the
independent of each other. No training is required where input scores. When K ¼ 1, (26) reduces to a score calibra-
the log-likelihoods are essentially summed with equal tion device. Such a form of calibration is referred to as the
weights. A more general form could be obtained by intro- application-independent calibration in [8], [9], and [135],
ducing additional control parameters as follows: the purpose of which is to allow a threshold to be set on the
calibrated scores based on the cost and prior (i.e., the
application parameters). Setting of threshold is crucial
X
K especially for language detection and open-set language
s0l ¼ k sðk; lÞ þ l (26) identification tasks. The rationale behind this, as given in
k¼1 [12], is as follows. To make a decision with uncalibrated
scores, one needs to probe the subsystem to understand
the behavior of its output scores and, thus, to learn where
for l ¼ 1; . . . ; N. Note that the weights k assigned to
to put the threshold. In contrast, given well-calibrated
subsystems are used across all target languages, while
scores, one can calculate the risk expectation in a straight-
biases l are made dependent on the targets. The fusion
forward way and, thus, make minimum risk decisions with
function is linear with respect to the log-likelihood scores.
no previous experience of the subsystem.
The weights and biases could be found by an exhaustive
grid search optimizing some application-dependent cost
functions [74]. A more systematic way is to use a multiclass C. Identification and Verification
logistic regression model [43] for the class posterior In a similar way as humans perceive the problem, it is
intuitive to see automatic language recognition as an
identification task. Given a spoken utterance, we match it
exp s0l against a predefined set of target languages and make a
P Ll js0l ¼ PN 0
(27) decision regarding the identity of the language spoken in
i¼1 expðsi Þ
the segment of speech. Another closely related recognition
problem is language verification or detection. Here, we are
and to find the parameters that maximize the log-posterior given a speech segment and the task is to decide between
probability on the development data, as follows [8], [10], two hypothesesVwhether the speech segment is from a
[12], [134]: target (or claimed) language.
Language recognition could be best manifested as a
multiclass recognition problem where the input is known
M X
X N to belong to one from a set of discrete classes. The objec-
Qð1 ; . . . ; K ; 1 ; . . . ; N Þ ¼ ml log P Ll js0l : tive is to recognize the class of the input. In what follows,
m¼1 l¼1 we adopt the formulation proposed in [8] and [10] to
(28) further illustrate the problem.
We are given a list of N target classes, fL1 ; L2 ; . . . ; LN g, Mixer [25] corpora was motivated by a more challenging
for each of which a prior probability is given (a flat prior task aiming at conversational speech using dialog between
could be assumed in a general application-independent individuals.
setting). In the closed-set scenario, fL1 ; L2 ; . . . ; LN g repre- Another drive that has contributed to advances in
sents N different explicitly specified languages. In the speech research is the effort toward standard protocols for
open-set case, L1 ; L2 ; . . . ; LN1 are explicitly specified lan- performance evaluation. The paradigm of formal evalua-
guages, and LN denotes any of the yet unseen out-of-set tion was established by NIST in response to this need by
languages. Such an open-set scenario could be handled by providing the research community with a number of
introducing an artificial ‘‘none-of-the-above’’ class into the essential elements, such as manually labeled data sets,
target set, for instance, at the Gaussian back–end. Given a well-defined tasks, evaluation metrics, and a postevalua-
spoken utterance O and the set of target languages, we have tion workshop [87].
the following different, but closely related, tasks [8], [10]. LREs were conducted by NIST in 1996, 2003, 2005,
• Language identification: Which of the N languages 2007, 2009, and 2011. It is evident that more languages are
does O belong to? being included from year to year, as can be observed in
• Language verification: Does O belong to language Ll Table 2, which summarizes the language profile. Another
or to other languages (i.e., one of the other N 1 highlight is that NIST LRE has set out to focus on a de-
languages)? tection task since its inception, as both closed-set and
For the identification task, we compute the likelihood open-set identification problems can be easily formulated
given each language and select the language hypothesis as applications of the detection task. Furthermore, per-
that yields the highest likelihood. The language verifica- formance measures such as identification accuracy, which
tion or detection is a binary classification problem, where a depends on the number of target languages, could be
decision has to be made between two hypotheses with factored out in the detection task.
respect to a decision threshold. In Section VI, we look into NIST LREs have also sought to examine dialect detec-
more details regarding threshold setting and performance tion capabilities, besides language detection. While the
assessment following the language detection protocol as term dialect may have different linguistic definitions, it
established in NIST LREs. means a variety of language that is a particular character-
istic of speakers from a specific geographical region. For
instance, the dialects of interest include variants of
VI. THE NIST L RE PARADIGM Chinese, English, Hindustani, and Spanish in LRE 2007,
as shown in Table 2. Dialect detection is generally believed
A. Corpora for LREs to be more challenging [6], [79] than language detection,
The availability of sufficiently large corpora has been as the precise boundaries between dialects are not always
the major driving factor in the development of speech clearly defined. In addition to the specified target lan-
technology in recent decades [25], [26], [63], [76], [87], guages (and dialects), several out-of-set languages (marked
[94]. To model a spoken language, we need a set of speech as ‘‘O’’) have also been included.
data for the language. To account for the within-language The emphasis of NIST LREs has been on conversational
variability, which we also call intersession variability, such telephone speech (CTS), since most of the likely applica-
as speaker, content, recording device, communication tions of the technology involve signals recorded from the
channel, and background noise, it is desirable to have suf- public telephone system [87]. In order to collect speech
ficient data that include the intended intersession effects. data of more languages in a cost-effective way, NIST has
The first large-scale speech data collection for the pur- adopted broadcast narrowband speech (BNBS) lately in
pose of language recognition research was carried out by LRE 2009 and LRE 2011 [26]. BNBS data are excerpts of
OGI in the early 1990s. Their efforts resulted in the call-in telephone speech embedded in broadcast and
OGI-11L and OGI-22L corpora [63], [94], which are now webcast. The call-in excerpts are used as we could expect
distributed through the Linguistic Data Consortium (LDC). them to cover as many speakers as possible. Broadcast
As its name implies, the OGI-11L is a multilanguage corpus entities like the Voice of America (VOA) broadcasts in
of 11 languages. The number of languages was expanded to more than 45 languages. Alternatively, the British Broad-
22 in OGI-22L. Similar data collections were organized by cast Company (BBC) also produces and distributes prog-
LDC leading to the CallHome and CallFriend corpora, rams in a large number of languages. They have become an
consisting of 6 and 12 languages, respectively. The afore- invaluable source of multilingual speech data.
mentioned corpora were collected over the telephone
network, which reaches out to speakers of different lan- B. Detection Cost and Detection Error Tradeoff
guages easily over a wide geographical area [94]. Both (DET) Curve
OGI-11L and OGI-22L are monologue speech corpora, In the detection task as defined in NIST LREs, system
where the callers answer to the questions prompted by a performance is evaluated by presenting the system with a
machine. The later collection of CallHome, CallFriend, and set of trials, each consisting of a test segment and a
Table 2 Languages Included in the NIST Series of LRE. Target hypothesized target language. The system has to decide for
Languages Are Marked as ‘‘X’’ While Out-of-Set Languages Are
each trial whether the target language was spoken in the
Marked as ‘‘O’’
given segment.
Let NT be the number of test segments and N be the
number of target languages as defined earlier. By present-
ing each test segment against all target languages, there are
NT number of trials for each target and the system under
evaluation should produce N NT number of true or false
decisions, one for each trial. The primary evaluation mea-
sure is the average detection cost, defined as follows:
1X N
Cavg ¼ CDET ðLl Þ (30)
N l¼1
1X N
Cavg ¼ Cmiss Ptar Pmiss ðLl Þ
N l¼1
|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}
Pmiss ð DET Þ
" #
1X N
1 X
þ Cfa ð1 Ptar Þ Pfa ðLl ; Lm Þ : (32)
N l¼1 N 1 m6¼l
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Pfa ð DET Þ
speech. As shown in Section II-A, perfect phone can be seen as an extension to the i-vector approach (see
transcription will provide distinctive phonotactic statistics Section IV-D), where we used multiple subspaces, instead
for languages. An obvious option is to improve the phone of one total variability space, each corresponding to a
recognizers in the tokenization front–end with a better language cluster.
modeling technique, such as the left-context–right-context In system development, we have shown the usefulness
FeatureNet [90] and deep neural networks [50], [78]. of the Gaussian back–end for decision fusion and calib-
Nonetheless, we also realize that we will never be able to ration. One subtle problem that a system designer may
achieve perfect phone transcription. An alternative is to encounter is the duration-dependent nature of the
find a better way in extracting the phone n-gram statistics Gaussian back–end. The score vectors produced by the
using existing phone recognizers, while continuing the front–end recognizer exhibit larger variation for the test
efforts to improve the phone recognition accuracy. Several segments with longer durations. This problem is tradi-
attempts have shown positive results along this direction. tionally dealt with by having a separate back–end for each
The lattice-based phone n-gram statistics [40] make use of of the nominal durations the system designers could fore-
the posterior probability as the soft counts as opposed to see. In [91], a simple parametric function is used to scale
phone n-gram counts. The target-oriented phone tokenizer the log-likelihood scores so that a single Gaussian back–
[128] and target-aware lattice rescoring [129] suggest dif- end could be used for all durations. A more elegant, but
ferent ways to make use of the same phone recognizer challenging, way could be to model directly the covariance
to create diverse statistics from the same source. The matrix as a function of the test segment duration, so that
cross-decoder (or cross-recognizer) phone co-occurrences the covariance of the Gaussian back–end could be updated
n-gram [103], for the first time, explores phonotactic on the fly according to the test duration.
information across multiple phone sequences from parallel The need for fast, efficient, accurate, and robust means
phone recognizers. These studies have shown that it is of language recognition is of growing importance for com-
worth the effort to explore phonotactic information from mercial, forensic, and government applications. The
different perspectives. Speaker and Language Characterization (SpLC), a special
In acoustic approaches, the latent factor modeling interest group of the International Speech Communication
technique, as used in JFA [57], [140] and i-vector [34], is Association (ISCA), started the Odyssey Speaker and
unleashing its potential in language recognition [35]. We Language Recognition Workshop in 1994 as a forum to
envisage its use for hierarchical modeling of languages. In foster interactions among researchers and to promote lan-
particular, a latent factor model could be used to capture guage recognition research. Other scientific forums in-
the majority of useful variability among languages from a clude the IEEE International Conference on Acoustics,
distinct language cluster, using separate subspaces for each Speech and Signal Processing (ICASSP) and the Annual
language cluster. Discrimination between languages can Conference of the International Speech Communication
then be done locally in each subspace corresponding to a Association (INTERSPEECH), where the latest findings
language cluster. From a modeling perspective, the idea are reported. h
[16] W. M. Campbell, J. P. Campbell, [30] D. Crystal, The Cambridge Factfinder. [46] T. J. Hazen and V. W. Zue, ‘‘Segment-based
D. A. Reynolds, E. Singer, and Cambridge, U.K.: Cambridge Univ. Press, automatic language identification,’’ J. Acoust.
P. A. Torres-Carrasquillo, ‘‘Support 1993. Soc. Amer., vol. 101, no. 4, pp. 2323–2331,
vector machines for speaker and language [31] S. Cumani and P. Laface, ‘‘Analysis of 1997.
recognition,’’ Comput. Speech Lang., vol. 20, large-scale SVM training algorithms [47] H. Hermansky and N. Morgan, ‘‘RASTA
no. 2–3, pp. 210–229, 2006. for language and speaker recognition,’’ processing of speech,’’ IEEE Trans. Speech
[17] W. Campbell, T. Gleason, J. Navratil, IEEE Trans. Audio Speech Lang. Audio Process., vol. 2, no. 4, pp. 578–589,
D. Reynolds, W. Shen, E. Singer, and Process., vol. 20, no. 5, pp. 1585–1596, Oct. 1994.
P. Torres-Carrasquillo, ‘‘Advanced Jul. 2012. [48] J. Hieronymus and S. Kadambe, ‘‘Spoken
language recognition using cepstra and [32] S. Davis and P. Mermelstein, ‘‘Comparison language identification using large
phonotactics: MITLL system performance of parametric representations for vocabulary speech recognition,’’ in Proc.
on the NIST 2005 language recognition monosyllabic word recognition in Int. Conf. Spoken Lang. Process., Philadelphia,
evaluation,’’ in Proc. IEEE Odyssey: continuously spoken sentences,’’ IEEE PA, USA, 1996, pp. 1780–1783.
Speaker Lang. Recognit. Workshop, Trans. Acoust. Speech Signal Process., vol. 28, [49] G. Hinton, ‘‘Products of experts,’’ in Proc.
San Juan, Puerto Rico, 2006, DOI: no. 4, pp. 357–366, Aug. 1980. 9th Int. Conf. Artif. Neural Netw., Edinburgh,
10.1109/ODYSSEY.2006.248097.
[33] N. Dehak, P. Dumouchel, and P. Kenny, U.K., 1999, vol. 1, pp. 1–6.
[18] W. M. Campbell, D. E. Sturim, and ‘‘Modeling prosodic features with joint [50] G. Hinton, L. Deng, D. Yu, G. Dahl,
D. A. Reynolds, ‘‘Support vector factor analysis for speaker verification,’’ A. Mohamed, N. Jaitly, A. Senior,
machines using GMM supervectors IEEE Trans. Audio Speech Lang. Process., V. Vanhoucke, P. Nguyen, T. Sainath, and
for speaker verification,’’ IEEE Signal vol. 15, no. 7, pp. 2095–2103, Sep. 2007. B. Kingsbury, ‘‘Deep neural networks for
Process. Lett., vol. 13, no. 5, pp. 308–310,
[34] N. Dehak, P. J. Kenny, R. Dehak, acoustic modeling in speech recognition:
May 2006.
P. Dumouchel, and P. Ouellet, ‘‘Front-end The shared views of four research groups,’’
[19] W. M. Campbell, ‘‘A covariance Kernel factor analysis for speaker verification,’’ IEEE Signal Process. Mag., vol. 29, no. 6,
for SVM language recognition,’’ in Proc. IEEE Trans. Acoust. Speech Signal pp. 82–97, Nov. 2012.
IEEE Int. Conf. Acoust. Speech Signal Process., vol. 19, no. 4, pp. 788–798, [51] A. S. House and E. P. Neuburg, ‘‘Toward
Process., Las Vegas, NV, USA, 2008, May 2011. automatic identification of the language
pp. 4141–4144.
[35] N. Dehak, P. Torres-Carrasquillo, of an utterance. I. Preliminary
[20] W. M. Campbell, D. E. Sturim, D. Reynolds, and R. Dehak, methodological considerations,’’ J. Acoust.
P. Torres-Carrasquillo, and D. A. Reynolds, ‘‘Language recognition via i-vectors Soc. Amer., vol. 62, no. 3, pp. 708–713,
‘‘A comparison of subspace feature-domain and dimensionality reduction,’’ in 1977.
methods for language recognition,’’ in Proc. Interspeech Conf., Florence, Italy, [52] X. Huang, A. Acero, and H. W. Hon, Spoken
Proc. Interspeech Conf., Brisbane, Australia, 2011, pp. 857–860. Language Processing: A Guide to Theory,
2008, pp. 309–312.
[36] A. Dumpster, N. Laird, and D. Rubin, Algorithm, and System Development.
[21] F. Castaldo, D. Colibro, E. Dalmasso, ‘‘Maximum likelihood from incomplete Englewood Cliffs, NJ, USA: Prentice-Hall,
P. Laface, and C. Vair, ‘‘Compensation data via the EM algorithm,’’ J. R. Stat. 2001.
of nuisance factors for speaker Soc., vol. 39, pp. 1–38, 1977. [53] M. I. Jordan and R. A. Jacobs, ‘‘Hierarchical
and language recognition,’’ IEEE
[37] T. Dunning, ‘‘Statistical identification mixtures of experts and the EM algorithms,’’
Trans. Audio Speech Lang. Process.,
of language,’’ Comput. Res. Lab (CRL), Neural Comput., vol. 6, pp. 181–214,
vol. 15, no. 7, pp. 1969–1978,
New Mexico State Univ., Las Cruces, 1994.
Sep. 2007.
NM, USA, Tech. Rep. MCCS-94-273, [54] B. H. Juang, W. Chou, and C.-H. Lee,
[22] W. B. Cavnar and J. M. Trenkle, 1994. ‘‘Minimum classification error rate
‘‘N-gram-based text categorization,’’ in
[38] J. Foil, ‘‘Language identification using methods for speech recognition,’’ IEEE
Proc. 3rd Annu. Symp. Document Anal.
noisy speech,’’ in Proc. IEEE Int. Conf. Trans. Speech Audio Process., vol. 5, no. 3,
Inf. Retrieval, Las Vegas, NV, USA,
Acoust. Speech Signal Process., Tokyo, pp. 257–265, May 1997.
1994, pp. 161–175.
Japan, 1986, pp. 861–864. [55] D. Jurafsky and J. H. Martin, Speech and
[23] C. Chelba, T. Hazen, and M. Saraclar,
[39] J. L. Gauvain and C.-H. Lee, ‘‘Maximum Language Processing. Englewood Cliffs,
‘‘Retrieval and browsing of spoken
a posterior estimation for multivariate NJ, USA: Prentice-Hall, 2000.
content,’’ IEEE Signal Process. Mag.,
Gaussian mixture observations of [56] P. Kenny and P. Dumouchel, ‘‘Experiments
vol. 25, no. 3, pp. 39–49, May 2008.
Markov chains,’’ IEEE Trans. Speech in speaker verification using factor analysis
[24] J. Chu-Carrol and B. Carpenter, Audio Process., vol. 2, no. 2, pp. 291–298, likelihood ratios,’’ in Proc. Odyssey: Speaker
‘‘Vector-based natural language Apr. 1994. Lang. Recognit. Workshop, Toledo, Spain,
call routing,’’ Comput. Linguist.,
[40] J. L. Gauvain, A. Messaoudi, and 2004, pp. 219–226.
vol. 25, no. 3, pp. 361–388, 1999.
H. Schwenk, ‘‘Language recognition [57] P. Kenny, G. Boulianne, P. Ouellet, and
[25] C. Cieri, J. P. Campbell, H. Nakasone, using phone lattices,’’ in Proc. Int. Conf. P. Dumouchel, ‘‘Joint factor analysis
D. Miller, and K. Walker, ‘‘The mixer Spoken Lang. Process., Jeju Island, Korea, versus eigenchannels in speaker
corpus of multilingual, multichannel 2004, pp. 1283–1286. recognition,’’ IEEE Trans. Audio Speech
speaker recognition data,’’ in Proc.
[41] E. M. Gold, ‘‘Language identification in Lang. Process., vol. 15, no. 4, pp. 1435–1447,
Int. Conf. Lang. Resources Eval., Lisbon,
the limit,’’ Inf. Control, vol. 10, no. 5, May 2007.
Portugal, 2004, pp. 24–30.
pp. 447–474, 1967. [58] K. Kirchhoff, ‘‘Language characteristics,’’ in
[26] C. Cieri, L. Brandschain, A. Neely,
[42] F. J. Goodman, A. F. Martin, and Multilingual Speech Processing, T. Schultz
D. Graff, K. Walker, C. Caruso, A. Martin,
R. E. Wohlford, ‘‘Improved automatic and K. Kirchhoff, Eds. Amsterdam,
and C. Greenberg, ‘‘The broadcast narrow
language identification in noisy The Netherlands: Elsevier, 2006.
band speech corpus: A new resource
speech,’’ in Proc. IEEE Int. Conf. Acoust. [59] J. Kittler, ‘‘Combining classifiers: A
type for large scale language recognition,’’ in
Speech Signal Process., Glasgow, Scotland, theoretical framework,’’ Pattern Anal.
Proc. Interspeech Conf., Brighton, U.K., 2009,
1989, vol. 1, pp. 528–531. Appl., no. 1, pp. 18–27, 1988.
pp. 2867–2870.
[43] T. Hastie, R. Tibshirani, and J. Friedman, [60] M. Kockmann, L. Ferrer, L. Burget, and
[27] D. Cimarusti and R. Ives, ‘‘Development
The Elements of Statistical Learning: J. Černocký, ‘‘iVector fusion of prosodic
of an automatic identification system
Data Mining, Inference, and Prediction. and cepstral features for speaker
of spoken languages: Phase I,’’ in
New York: Springer-Verlag, 2008. verification,’’ in Proc. Interspeech Conf.,
Proc. IEEE Int. Conf. Acoust. Speech
Signal Process., Paris, France, 1982, [44] T. J. Hazen and V. W. Zue, ‘‘Automatic Florence, Italy, 2011, pp. 265–268.
pp. 1661–1663. language identification using a [61] L. I. Kuncheva, ‘‘A theoretical study
segment-based approach,’’ in Proc. on six classifier fusion strategies,’’ IEEE
[28] H. P. Combrinck and E. C. Botha,
Eurospeech Conf., Berlin, Germany, Trans. Pattern Anal. Mach. Intell., vol. 24,
‘‘Automatic language identification:
1993, pp. 1303–1306. no. 2, pp. 281–286, Feb. 2002.
Performance vs. complexity,’’ in Proc.
8th Annu. South Africa Workshop Pattern [45] T. J. Hazen and V. W. Zue, ‘‘Recent [62] L. F. Lamel and J. L. Gauvain, ‘‘Language
Recognit., Grahams Town, South Africa, improvements in an approach to identification using phone-based acoustic
1997. segment-based automatic language likelihoods,’’ in Proc. IEEE Int. Conf. Acoust.
identification,’’ in Proc. IEEE Int. Conf. Speech Signal Process., Adelaide, Australia,
[29] B. Comrie, The World’s Major Languages.
Acoust. Speech Signal Process., Adelaide, 1994, vol. 1, pp. 293–296.
Oxford, U.K.: Oxford Univ. Press, 1990.
Australia, 1994, pp. 1883–1886.
[63] T. Lander, R. Cole, B. Oshika, and M. Noel, [79] B. P. Lim, H. Li, and B. Ma, ‘‘Using speech corpus,’’ in Proc. Int. Conf. Spoken
‘‘The OGI 22 language telephone speech local and global phonotactic features Lang. Process., BanffABCanada, 1992,
corpus,’’ in Proc. Eurospeech Conf., Madrid, in Chinese dialect identification,’’ in pp. 895–898.
Spain, 1995, pp. 817–820. Proc. IEEE Int. Conf. Acoust. Speech Signal [95] Y. K. Muthusamy, K. M. Berkling, T. Arai,
[64] C.-H. Lee, ‘‘Principles of spoken language Process., Philadelphia, PA, USA, 2005, R. A. Cole, and E. Barnard, ‘‘A comparison
recognition,’’ in Springer Handbook pp. 577–580. of approaches to automatic language
of Speech Processing and Speech [80] S. Lucey and T. Chen, ‘‘Improved identification using telephone speech,’’ in
Communication, J. Benesty, M. M. Sondhi, speaker verification through probabilistic Proc. Eurospeech Conf., Berlin, Germany,
and A. Huang, Eds. New York, NY, USA: subspace adaptation,’’ in Proc. Eurospeech 1993, pp. 1307–1310.
Springer-Verlag, 2008. Conf., Geneva, Switzerland, 2003, [96] Y. K. Muthusamy, N. Jain, and R. A. Cole,
[65] K. A. Lee, C. You, and H. Li, ‘‘Spoken pp. 2021–2024. ‘‘Perceptual benchmarks for automatic
language recognition using support vector [81] B. Ma, C. Guan, H. Li, and C.-H. Lee, language identification,’’ in Proc. IEEE Int.
machines with generative front-end,’’ in ‘‘Multilingual speech recognition with Conf. Acoust. Speech Signal Process., Adelaide,
Proc. IEEE Int. Conf. Acoust. Speech Signal language identification,’’ in Proc. Int. Australia, 1994, vol. 1, pp. 333–336.
Process., Las Vegas, NV, USA, 2008, Conf. Spoken Lang. Process., Denver, [97] Y. K. Muthusamy, E. Barnard, and
pp. 4153–4156. CO, USA, 2002, pp. 505–508. R. A. Cole, ‘‘Reviewing automatic language
[66] K. A. Lee, C. H. You, H. Li, T. Kinnunen, and [82] B. Ma, H. Li, and R. Tong, ‘‘Spoken language identification,’’ IEEE Signal Process. Mag.,
K. C. Sim, ‘‘Using discrete probabilities recognition with ensemble classifiers,’’ IEEE vol. 11, no. 4, pp. 33–41, Oct. 1994.
with Bhattacharyya measure for SVM-based Trans. Audio Speech Lang. Process., vol. 15, [98] S. Nakagawa, Y. Ueda, and T. Seino,
speaker verification,’’ IEEE Trans. Audio no. 7, pp. 2053–2062, Sep. 2007. ‘‘Speaker-independent, text-independent
Speech Lang. Process., vol. 19, no. 4, [83] C. D. Manning, P. Raghavan, and language identification by HMM,’’ in
pp. 861–870, May 2011. H. Schütze, Introduction to Information Proc. Int. Conf. Spoken Lang. Process., Banff,
[67] K. A. Lee, C. H. You, V. Hautamäki, Retrieval. Cambridge, U.K.: Cambridge AB, Canada, 1992, pp. 1011–1014.
A. Larcher, and H. Li, ‘‘Spoken language Univ. Press, 2008. [99] J. Navratil, ‘‘Spoken language
recognition in the latent topic simplex,’’ in [84] A. Martin, G. Doddington, T. Kamm, recognitionVA step toward multilinguality
Proc. Interspeech Conf., Florence, Italy, 2011, M. Ordowski, and M. Przybocki, ‘‘The in speech processing,’’ IEEE Trans. Speech
pp. 2933–2936. DET curve in assessment of detection Audio Process., vol. 9, no. 6, pp. 678–685,
[68] L. Lee and R. C. Rose, ‘‘Speaker task performance,’’ in Proc. Eurospeech Sep. 2001.
normalization using efficient frequency Conf., Rhodes, Greece, 1997, vol. 4, [100] R. W. M. Ng, C.-C. Leung, T. Lee, B. Ma, and
warping procedures,’’ in Proc. IEEE Int. pp. 1895–1898. H. Li, ‘‘Prosodic attribute model for spoken
Conf. Acoust. Speech Signal Process., Atlanta, [85] A. F. Martin and M. A. Przybocki, ‘‘NIST language identification,’’ in Proc. IEEE Int.
GA, USA, 1996, vol. 1, pp. 353–356. 2003 language recognition evaluation,’’ in Conf. Acoust. Speech Signal Process., Dallas,
[69] R. G. Leonard and G. R. Doddington, Proc. Eurospeech Conf., Geneva, Switzerland, TX, USA, 2010, pp. 5022–5025.
‘‘Automatic language identification,’’ 2003, pp. 1341–1344. [101] R. W. M. Ng, C.-C. Leung, V. Hautamäki,
Air Force Rome Air Develop. Cntr., [86] A. F. Martin and A. N. Le, ‘‘The current T. Lee, B. Ma, and H. Li, ‘‘Towards
Tech. Rep. RADC-TR-74-200, state of language recognition: NIST 2005 long-range prosodic attribute modeling
Aug. 1974. evaluation results,’’ in Proc. IEEE Odyssey: for language recognition,’’ in Proc.
[70] R. G. Leonard and G. R. Doddington, Speaker Lang. Recognit. Workshop, San Juan, Interspeech Conf., Chiba, Japan, 2010,
‘‘Automatic language identification,’’ Puerto Rico, 2006, DOI: 10.1109/ODYSSEY. pp. 1792–1795.
Air Force Rome Air Develop. Cntr., 2006.248104. [102] NIST Language Recognition Evaluations.
Tech. Rep. RADC-TR-75-264, Oct. 1975. [87] A. F. Martin and J. S. Garofolo, ‘‘NIST speech [Online]. Available: http://nist.gov/itl/iad/
[71] M. P. Lewis, Ed., Ethnologue: Languages processing evaluations: LVCSR, speaker mig/lre.cfm
of the World, 16 ed. Dallas, TX, USA: recognition, language recognition,’’ in Proc. [103] M. Penagarikano, A. Varona,
SIL International, 2009. IEEE Workshop Signal Process. Appl. Public L. J. Rodriguez-Fuentes, and G. Bordel,
[72] H. Li and B. Ma, ‘‘A phonotactic language Security Forensics, Washington, DC, USA, ‘‘Improved modeling of cross-decoder phone
model for spoken language identification,’’ in 2007, pp. 1–7. co-occurrences in SVM-Based phonotactic
Proc. Assoc. Comput. Linguist., Ann Arbor, [88] A. F. Martin and A. N. Le, ‘‘NIST 2007 language recognition,’’ IEEE Trans. Audio
MI, USA, 2005, pp. 515–522. language recognition evaluation,’’ presented Speech Lang. Process., vol. 19, no. 8,
[73] H. Li, B. Ma, and C.-H. Lee, ‘‘A vector at the Odyssey: Speaker Lang. Recognit. pp. 2348–2363, Nov. 2011.
space modeling approach to spoken Workshop, Stellenbosch, South Africa, 2008, [104] S. Pigeon, P. Druyts, and P. Verlinde,
language identification,’’ IEEE Trans. paper 016. ‘‘Applying logistic regression to fusion
Audio Speech Lang. Process., vol. 15, no. 1, [89] A. F. Martin and C. Greenberg, ‘‘The 2009 of the NIST’99 1-speaker submissions,’’
pp. 271–284, Jan. 2007. NIST language recognition evaluation,’’ in Digital Signal Process., vol. 10, pp. 237–248,
[74] H. Li, B. Ma, K. C. Sim, R. Tong, K. A. Lee, Proc. Odyssey: Speaker Lang. Recognit. 2000.
H. Sun, D. Zhu, C. You, M. Dong, and Workshop, Brno, Czech Republic, 2010, [105] D. Qu and B. Wang, ‘‘Discriminative training
X. Wang, ‘‘Institute for Infocomm Research pp. 165–171. of GMM for language identification,’’ in Proc.
system description for the language [90] P. Matejka, P. Schwarz, J. Cernocky, and ISCA IEEE Workshop Spontaneous Speech
recognition evaluation 2007 submission,’’ P. Chytil, ‘‘Phonotactic language Process. Recognit., Tokyo, Japan, 2003,
in Proc. NIST Lang. Recognit. Eval. Workshop, identification using high quality phoneme pp. 67–70.
Orlando, FL, USA, 2007. recognition,’’ in Proc. Interspeech Conf., [106] T. F. Quatieri, Discrete-Time Speech
[75] H. Li, B. Ma, and C.-H. Lee, ‘‘Vector-based Lisbon, Portugal, 2005, pp. 2237–2240. Signal Processing: Principles and Practice.
spoken language classification,’’ in Springer [91] A. McCree, F. Richardson, E. Singer, and Englewood Cliffs, NJ, USA: Prentice-Hall,
Handbook of Speech Processing and Speech D. Reynolds, ‘‘Beyond frame independent: 2002.
Communication, J. Benesty, M. M. Sondhi, Parametric modeling of time duration in [107] L. R. Rabiner, ‘‘A tutorial on hidden Markov
and A. Huang, Eds. New York, NY, USA: speaker and language recognition,’’ in Proc. models and selected publication in speech
Springer-Verlag, 2008. Interspeech Conf., Brisbane, Australia, 2008, recognition,’’ Proc. IEEE, vol. 77, no. 2,
[76] H. Li and B. Ma, ‘‘TechWare: Speaker and pp. 767–770. pp. 257–286, Feb. 1989.
spoken language recognition resources,’’ [92] S. Mendoza, L. Gillick, Y. Ito, S. Lowe, [108] F. Ramus and J. Mehler, ‘‘Language
IEEE Signal Process. Mag., vol. 27, no. 6, and M. Newman, ‘‘Automatic language identification with suprasegmental cues:
pp. 139–142, Nov. 2010. identification using large vocabulary A study based on speech re-synthesis,’’
[77] K. P. Li and T. J. Edwards, ‘‘Statistical continuous speech recognition,’’ in Proc. J. Acoust. Soc. Amer., vol. 105, no. 1,
models for automatic language IEEE Int. Conf. Acoust. Speech Signal pp. 512–521, 1999.
identification,’’ in Proc. IEEE Int. Conf. Process., Atlanta, GA, USA, 1996, vol. 2,
[109] R. Ramus, M. Nespor, and J. Mehler,
Acoust. Speech Signal Process., Denver, pp. 785–788.
‘‘Correlates of linguistic rhythm in the
CO, USA, 1980, pp. 884–887. [93] T. P. Minka, ‘‘Algorithms for speech signal,’’ Cognition, vol. 73, no. 3,
[78] X. Li, L. Deng, and J. Bilmes, ‘‘Machine maximum-likelihood logistic regression,’’ pp. 265–292, 1999.
learning paradigms for speech recognition: Carnegie Mellon Univ., Pittsburgh, PA,
[110] D. A. Reynolds and R. C. Rose, ‘‘Robust
An overview,’’ IEEE Trans. Audio Speech USA, Tech. Rep. 738, 2001.
text-independent speaker identification
Lang. Process., accepted for publication. [94] Y. K. Muthusamy, R. Cole, and B. Oshika, using Gaussian mixture speaker models,’’
‘‘The OGI multi-language telephone
IEEE Trans. Speech Audio Process., vol. 3, Process., Philadelphia, PA, USA, 2005, Workshop, Stellenbosch, South Africa, 2008,
no. 1, pp. 72–83, Jan. 1995. pp. 629–632. paper 012.
[111] D. A. Reynolds, T. F. Quatieri, and [125] M. Sugiyama, ‘‘Automatic language [138] V. Vapnik, The Nature of Statistical
R. B. Dunn, ‘‘Speaker verification using recognition using acoustic features,’’ in Learning Theory. New York, NY, USA:
adapted Gaussian mixture models,’’ Proc. IEEE Int. Conf. Acoust. Speech Springer-Verlag, 1995.
Digital Signal Process., vol. 10, pp. 19–41, Signal Process., Toronto, ON, Canada, [139] C. Vair, D. Colibro, F. Castaldo,
2000. 1991, vol. 2, pp. 813–816. E. Dalmasso, and P. Laface, ‘‘Channel
[112] F. S. Richardson and W. M. Campbell, [126] M. E. Tipping and C. M. Bishop, ‘‘Mixtures factors compensation in model and feature
‘‘Language recognition with discriminative of probabilistic principal component domain for speaker recognition,’’ in Proc.
keyword selection,’’ in Proc. IEEE Int. Conf. analysis,’’ Neural Comput., vol. 11, no. 2, IEEE Odyssey: Speaker Lang. Recognit.
Acoust. Speech Signal Process., Las Vegas, pp. 443–482, 1999. Workshop, San Juan, Puerto Rico, 2006,
NV, USA, 2008, pp. 4145–4148. [127] R. Tong, B. Ma, D. Zhu, H. Li, and DOI: 10.1109/ODYSSEY.2006.248117.
[113] G. Salton, The SMART Retrieval System. E.-S. Chng, ‘‘Integrating acoustic, prosodic [140] R. Vogt and S. Sridharan, ‘‘Explicit
Englewood Cliffs, NJ, USA: Prentice-Hall, and phonotactic features for spoken language modelling of session variability for
1971. identification,’’ in Proc. IEEE Int. Conf. speaker verification,’’ Comput. Speech
[114] G. Salton and C. Buckley, ‘‘Term-weighting Acoust. Speech Signal Process., Toulouse, Lang., vol. 22, pp. 17–38, 2008.
approaches in automatic text retrieval,’’ France, 2006, pp. 205–208. [141] A. Waibel, P. Geutner, L. M. Tomokiyo,
Inf. Process. Manage., vol. 24, no. 5, [128] R. Tong, B. Ma, H. Li, and E. Chng, T. Schultz, and M. Woszczyna,
pp. 513–523, 1988. ‘‘A target-oriented phonotactic front-end ‘‘Multilinguality in speech and spoken
[115] T. Schultz, I. Rogina, and A. Waibel, for spoken language recognition,’’ IEEE language systems,’’ Proc. IEEE, vol. 88, no. 8,
‘‘LVCSR-based language identification,’’ in Trans. Audio Speech Lang. Process., pp. 1181–1190, Aug. 2000.
Proc. IEEE Int. Conf. Acoust. Speech Signal vol. 17, no. 7, pp. 1335–1347, [142] P. C. Woodland and D. Povey, ‘‘Large scale
Process., Atlanta, GA, USA, 1996, vol. 2, Sep. 2009. discriminative training of hidden Markov
pp. 781–784. [129] R. Tong, B. Ma, H. Li, and E. Chng, models for speech recognition,’’ Comput.
[116] T. Schultz and A. Waibel, ‘‘Language ‘‘Target-aware lattice rescoring for dialect Speech Lang., vol. 16, no. 1, pp. 25–47,
independent and language adaptive,’’ Speech recognition,’’ in Proc. Interspeech Conf., 2002.
Commun., vol. 35, no. 1–2, pp. 31–51, Florence, Italy, 2011, pp. 733–736. [143] Y. Yan and E. Barnard, ‘‘An approach to
2001. [130] P. A. Torres-Carrasquillo, D. A. Reynolds, automatic language identification based on
[117] T. Schultz, ‘‘Globalphone: A multilingual and R. J. Deller, Jr., ‘‘Language identification language-dependent phone recognition,’’ in
text and speech database developed at using Gaussian mixture model tokenization,’’ Proc. IEEE Int. Conf. Acoust. Speech Signal
Karlsruhe University,’’ in Proc. Interspeech in Proc. IEEE Int. Conf. Acoust. Speech Process., Detroit, MI, USA, 1995, vol. 5,
Conf., Denver, CO, USA, 2002, pp. 345–348. Signal Process., Orlando, FL, USA, 2002, pp. 3511–3514.
pp. 757–760. [144] Y. Yan, E. Barnard, and R. Cole,
[118] W. Shen, W. Campbell, T. Gleason,
D. Reynolds, and E. Singer, ‘‘Experiments [131] P. Torres-Carrasquillo, E. Singer, M. Kohler, ‘‘Development of an approach to language
with lattice-based PPRLM language R. Greene, D. Reynolds, and J. Deller, Jr., identification based on phone recognition,’’
identification,’’ in Proc. IEEE Odyssey: ‘‘Approaches to language identification using Comput. Speech Lang., vol. 10, pp. 37–54,
Speaker Lang. Recognit. Workshop, San Juan, Gaussian mixture models and shifted delta 1996.
Puerto Rico, 2006, DOI: 10.1109/ODYSSEY. cepstral features,’’ in Proc. Int. Conf. Spoken [145] C. H. You, K. A. Lee, and H. Li, ‘‘GMM-SVM
2006.248100. Lang. Process., Denver, CO, USA, 2002, kernel with a Bhattacharyya-based distance
pp. 89–92. for speaker recognition,’’ IEEE Trans. Audio
[119] W. Shen and D. A. Reynolds, ‘‘Improved
GMM-Based language recognition using [132] P. Torres-Carrasquillo, E. Singer, Speech Lang. Process., vol. 18, no. 6,
constrained MLLR transforms,’’ in W. Campbell, T. Gleason, A. McCree, pp. 1300–1312, Aug. 2010.
Proc. Int. Conf. Acoust. Speech Signal D. Reynolds, F. Richardson, W. Shen, and [146] C. H. You, H. Li, and K. A. Lee, ‘‘A
Process., Las Vegas, NV, USA, 2008, D. Sturim, ‘‘The MITLL NIST LRE 2007 GMM-supervector approach to language
pp. 4149–4152. language recognition system,’’ in Proc. recognition with adaptive relevance factor,’’
Interspeech Conf., Brisbane, Australia, in Proc. EUSIPCO, Aalborg, Denmark, 2010,
[120] K. C. Sim and H. Li, ‘‘On acoustic
2008, pp. 719–722. pp. 1993–1997.
diversification front-end for spoken
language identification,’’ IEEE [133] R. C. Tucker, M. J. Carey, and E. S. Paris, [147] J. Zhao, H. Shu, L. Zhang, X. Wang, Q. Gong,
Trans. Audio Speech Lang. Process., ‘‘Automatic language identification using and P. Li, ‘‘Cortical competition during
vol. 16, no. 5, pp. 1029–1037, sub-words models,’’ in Proc. IEEE Int. Conf. language discrimination,’’ NeuroImage,
Jul. 2008. Acoust. Speech Signal Process., Adelaide, vol. 43, pp. 624–633, 2008.
Australia, 1994, pp. 301–304.
[121] E. Singer, P. Torres-Carrasquillo, [148] D. Zhu, B. Ma, and H. Li, ‘‘Soft margin
D. Reynolds, A. McCree, F. Richardson, [134] D. A. van Leeuwen and N. Brummer, estimation of Gaussian mixture model
N. Dehak, and D. Sturim, ‘‘The MITLL ‘‘Channel-dependent GMM and multi-class parameters for spoken language
NIST LRE2011 language recognition logistic regression models for language recognition,’’ in Proc. IEEE Int. Conf.
system,’’ in Proc. Odyssey: Speaker Lang. recognition,’’ in Proc. IEEE Odyssey: Acoust. Speech Signal Process., Dallas,
Recognit. Workshop, Singapore, 2012, Speaker Lang. Recognit. Workshop, San Juan, TX, USA, 2010, pp. 4990–4993.
pp. 209–215. Puerto Rico, 2006, DOI: 10.1109/ODYSSEY.
[149] M. A. Zissman, ‘‘Automatic language
2006.248094.
[122] S. M. Siniscalchi, J. Reed, T. Svendsen, and identification using Gaussian mixture
C.-H. Lee, ‘‘Exploring universal attribute [135] D. A. van Leeuwen and N. Brümmer, and hidden Markov models,’’ in Proc.
characterization of spoken languages for ‘‘An introduction to application independent IEEE Int. Conf. Acoust. Speech Signal
spoken language recognition,’’ in Proc. evaluation of speaker recognition systems,’’ Process., Minneapolis, MN, USA,
Interspeech Conf., Brighton, U.K., 2009, in Speaker Classification, vol. 4343, 1993, vol. 2, pp. 399–402.
pp. 168–171. R. Müller, Ed. Berlin, Germany:
[150] M. A. Zissman, ‘‘Comparison of four
Springer-Verlag, 2007.
[123] S. M. Siniscalchi, J. Reed, T. Svendsen, and approaches to automatic language
C.-H. Lee, ‘‘Exploiting context-dependency [136] D. A. van Leeuwen and K. P. Truong, identification of telephone speech,’’
and acoustic resolution of universal speech ‘‘An open-set detection evaluation IEEE Trans. Speech Audio Process.,
attribute models in spoken language methodology applied to language and vol. 4, no. 1, pp. 31–44, Jan. 1996.
recognition,’’ in Proc. Interspeech Conf., emotion recognition,’’ in Proc. Interspeech
[151] M. A. Zissman, ‘‘Predicting, diagnosing
Chiba, Japan, 2010, pp. 2718–2721. Conf., Antwerp, Belgium, 2007, pp. 338–341.
and improving automatic language
[124] A. Solomonoff, W. M. Campbell, and [137] D. A. van Leeuwen, M. Boer, and R. Orr, identification performance,’’ in Proc.
I. Boardman, ‘‘Advances in channel ‘‘A human benchmark for the NIST language Eurospeech Conf., Rhodes, Greece, 1997,
compensation for SVM speaker recognition,’’ recognition evaluation 2005,’’ presented at pp. 51–54.
in Proc. IEEE Int. Conf. Acoust. Speech Signal the Odyssey: Speaker Lang. Recognit.