You are on page 1of 4

MULTI-STREAM LANGUAGE IDENTIFICATIONUSING DATA-DRIVEN DEPENDENCY

SELECTION
Sonia Parandekar, Katrin Kirchhoff
{sonia,katrin}@ee.washington.edu
Department of Electrical Engineering
University of Washington, Seattle, USA

ABSTRACT speech signal and arc arranged into five separate groups (man-
ner of articulation, consonanralplace of articulation, vowel place
The most widespread approach to automatic language identi- of articulation, front-back tongue position and lip rounding) and
fication in the past has been the statistical modeling of phone se- Acoustic models are built for each feature value, analogous to
quences extracted from speech signals. Recently, we have devel- acoustic phone models. Using these models, parallel streams of
oped an alternative approach to LID based on n-gram modeling of feature sequences, one for each feature group, are derived from a
parallel streams of articulatory features, which was shown to have given speech signal. Language-specific n-gram models are trained
advantages over phone-based systems on short test signals whereas for each feature stream. Analogous to the n-gram probability of a
the latter achieved a higher accuracy on longer signals. Addition- phone sequence, the probability of a feature sequence in a pmicu-
ally, phone and feature streams can be combined to achieve max- lar feature stream 3 = f i , ...fN given a language L, P ( 3 ' J L )is,
imum performance. Within this "multi-scream" framework two defined as:
types of statistical dependencies need to be modeled: (a) depen-
dencies between symbols in individual streams and (b) dependen- N
cies between symbols in different streams. The space of possible
dependencies is typically too large to be searched exhaustively. I n P(31L) = n P ( f i 1 f i - 1,..., fi-n+i,L) (3)
this paper, we explore the use of genetic algorithms as a method for i=n
data-driven dependency selection. The result is a general frame-
work for the discovery and modeling of dependencies between The probability of an ensemble of K feature Streams 3 1 , ..., 3 K
multiple information sources expressed as sequences of symbols, given language L, P ( h ,...,FKKJL),
is defined as:
which has implications for other fields beyond language identifi-
cation, such as speaker identification or language modeling. L )C(P(3ilL),, ..., p(3~1.G)) (4)
P ( 3 1 ,..., ~ K K J =

1. INTRODUCTION N-GRAM MODEL APPROACHES where C is some combination function, e.g. the product rule
TO LANGUAGE IDENTIFICATION
Automatic language identification (LID) continues to be of con-
siderable imponance for multilingual speech applications. Several
k=l
approaches to LID have been developed in the past which make
use of acoustic, prosodic, phonetic-phonotactic or lexical infor-
mation. Of these, the phonotactic approach (e.g.[l]) has emerged In our previous work we showed that both feature-based and phone-
as the most widespread and flexible technique. This approach as- based approaches achieved comparable performance overall but
sumes that language-discriminating information is encoded in the that the feature-based system obtained a significantly higher per-
statistical regularities of phone sequences in different languages. formance on very short test signals (<_3sec. ) whereas the phone-
As a first step, the speech signal is mapped to a sequence of phone based system achieved a higher accuracy on longer test signals.
symbols, 0 = $ l , d ~...,, +N, using acoustic models such as Hid- Due to the complementary nature of the two approaches, they can
den Markov Models (HMMs). Statistical n-gram models are then be combined to achieve maximum performance. A seamless way
trained on the resulting phone labels. An n-gram model specifies a of integrating the phone and feature-based systems is to treat the
set of probability distributions of a phone given a context of n - 1 stream of phones as an additional stream within the set of artic-
and the language L: ulatory feature streams. The equation for language classification
would now become
N

~ +NIL) = f l P ( d h L i , ...,d i - n t i , L )
P ( h , d..., (1) L' = arpas'P(F1, ...,FK,6(L) (6)
i=n
Naturally, this can be extended to Using multiple phone sequences,
During language identification, the phone sequence derived from as is the standard in some phone-based LID systems. In all "multi-
the test speech signal is scored against each of the language-specific stream" models of this type, two sets of dependencies need to be
n-gram models. The language of the n-gram model for which the modeled (a) dependencies within individual streams, and (b) de-
highest score is obtained is then hypothesized as the true language pendencies across different streams. The model in Equation 5 a.-
(Le): sumes that all streams are independent given the language and thus
L' = argmazLP(di,dz,..., $NIL) (2) ignores dependencies of type (b), which is clearly an oversimpli-
In our recently developed feature-based approach [Z, 31, artic- fication. The main objective of this study is to explore how statis-
ulatory feature sequences are used in place of phone sequences. tical dependencies between different information streams can be
These features characterize different articulatory properties of the detected and modeled more adequately.

0-7803-7663-3/03/$17.0002003 IEEE 1-28 ICASSP 2003


2. MODELING CROSS-STREAM DEPENDENCIES several sections of the search space for a particular pmb-
lem, which often prevents them from converging on a local
In our work, the phone stream is treated as an additional ( K + I)th optimum.
feature stream. The baseline phone and feature systems described 3. GAS employ a user defined fitness funclion to determine
above model the probabilities of a symbol at a given time t con- the goodness of each solution or the entire population. A I
ditioned on symbols at previous time positions within the same each iteration of the genetic search, the current population
stream. Possible dependencies on variables in other streams are is evaluated and modified to move to a higher fitness value.
not taken into account. However, the different streams are ex-
tracted from the same speech signal and conditioning a symbols on GA operators are simple, well defined functions that are applied
variables in other streams nught therefore yield additional gains. A io members within its population. The most basic operators em-
cross-stream model can be represented more formally as: ployed by GAS are reproduction or selecrion, crossover, and mu-
farion. Reproduction is a process by which individual strings are
P(f!) = P(f!lf:-l>fl-2,...., f!-n+*?F\{3x (7) copied into a pool from which strings for the next iteration are
selected. Strings with a higher fitness value have a greater proba-
where, F \ {j} represents some subset of the set of all features bility of contributing to this pool. During crossover, pairs of mem-
bers from the pool are selected and new strings are produced by
minus those in stream j. In general, for a fixed context length of n
swapping characters from the original.pair at randomly selected
and K streams, there are n K - 1 conditioning features, viz. the
positions. Mutation is the occasional random alteration of value
n - 1previous features in the same stream and all n features in the
current context window in the K - 1 remaining streams. To find of a bit or digit in the string. Reproduction and crossover try to
preserve good partial solutions from one iteration to the next. Mu-
the optimal combination of some or all of these conditioning van-
tation is of secondary importance compared to reproduction and
ables, we in principle need to conduct an exhaustive search over crossover and is used mainly to maintain the heterogeneity of the
all possible subsets. The number of possible subsets, E:=, (1). population and to prevent premature convergence. GA search in-
where j = n K - 1, is prohibitively large and cannot be searched volves the following steps:
exhaustively.
For models with cross-stream dependencies, the conditioning encode the problem parameters (i.e decision variables) as
and the dependent streams need to be aligned to determine joint binary strings;
frequencies. A simultaneous alignment of all streams at the frame determine an appropriate fitness function,
level typically leads to multiple repetitions of the same symbol
within a stream and across sub-groups of streams. We noticed in randomly generate an initial population of strings;
previous experiments that such repetitions decrease accuracy sig- while the termination criterion has not been reached
nificantly as they tend to dominate the n-gram scores. To overcome
this problem, we separately align the sets of conditioning and de-
pendent variables corresponding to different cross-stream depen-
- evaluate the fitness of each individual in the popula-
tion and select a pool of solutions for the next gener-
dencies and, within each set, only use those vectors of variable ation
values for scoring where at least one value changes. The scores for
the different stream groupings, normalized by the number of vec- - apply crossover and mutation
tors considered, are then combined using Equation 5 . Weighted The operators work on successive generations of solutions, with
combination using another classifier, e.g. a multi-layer perceptron, each generation producing more and more refined solutions. The
was not shown to yield any advantages beyond product combina- algorithm stops when some termination criterion (usually a spe-
tion in the past. cific value of the fitness function or value of its change over sev-
In our previous work [2] pair-wise crowstream dependencies eral generations) is satisfied. Several alternatives are available for
in a purely feature-based system were identified using a greedy the specific implementation of the GA operators. In our work
search technique. Their integration yielded improvements in LID we investigated different implementations, namely roulette wheel,
accuracy; however, only a small subset of all possible dependen- stochastic universal sampling and tournament for selection, and
cies was explored. In this study we use a more powerful search one-point, two-point and uniform crossover operators. The choice
technique, viz. Genetic Algorithms (GA), as described in the next of operator implementation had an impact on efficiency of the
section. search but rarely on the final outcome and we settled upon touma-
ment selection and uniform crossover for most of our experiments.
3. GENETIC ALGORITHMS FOR DEPENDENCY The GA can be made more powerful by integrating advanced op-
SELECTION erators and techniques. One such technique that we have adopted
is an elirisr model, which ensures that the best individual of a gen-
eration is preserved in the next generation.
Genetic Algorithms are a general search and optimization tech-
nique inspired by natural evolutionary processes. Some of the dis- In order to apply CA-based search to the problem of depen-
tinguishing characteristics of GAS are the following: dency selection, the set of conditioning features for any given de-
pendent feature variable needs to be encoded in a string. Consider
1. GAS do not deal directly with problem parameters them- a feature variable fy,with a potential set of conditioning variables
selves but with binary string encodings of individual prab- defined by:
lem solutions. Different solutions are created by applying c ff-2, fL 1 RI fi-2 > fi- 1 , f i fEl>f 2 l >
I I
genetic operators modifying these strings, as described be- where the context length of interest is 3. This conditioning set can
low. Since these operators are applied probabilistically, the be represented by an eight bit binary string with 1 representing
search is guided towards various unexplored parts of the presence and 0 representing absence of the dependency. For ex-
search space that might potentially contain better solutions. amde. 10111001 would imvlv that f."' is conditioned on:
In contrast, deterministic techniques always search within a ' ~ f ~ - Z , ~ ~ , f ~ - Z , ~ - ' ~ , f ~ ~ . ~ .
fixed pre-defined area of the search space. It is important to prevent circular dependencies in order to obtain
2. GAS work with apopularion of potential solutions i.e. sets valid probability distributions. This occurs, for instance, when f,?
of parameter values (in the form of encoded strings) rather is conditioned on f? and vice versa. One method of overcomine
than a single solution. GAS thus simultaneously explore this problem is to.kstrict the potential conditioning set for ea&

I - 29
dependent variable such that no circular dependencies are possi- Set Phone Feature Combined
ble. The individual streams are arranged in an ordered tuple, for ___
3 G z r T 5 8 1 9 6-64.54
example: < j,k,1, m >. Any feature variable can he conditioned Eval. set 5773062.17
only on other feature variables in its own stream and those pre- #params 2.35M 30K 2.38M
ceding it in the tuple. Though this effectively implies that the GA
search will now he exploring a more restricted search space, dif-
ferent orderings of the set of features can he evaluated to get the
optimal set of dependencies. Since the final goal is to improve
on language identification accuracy, we use LID performance on a
held-out development set as the fitness function.

4. CORPUS AND BASELINE SYSTEMS

Experiments reported in this paper are based on the OGI-TS cor-


pus [4J of telephone speech data from 10 different languages (En-
glish, Farsi, French, German, Japanese, Korean, Mandarin, Span-
ish, Tamil and Vietnamese). We follow the same division of the Table 2. LID accuracy (%I, number of system parameters, and
corpus into training, development and evaluation sets as defined number of selected cross-stream dependencies .
in 151, which is identical to the definitions included in the LDC
distribution. These sets contain 4650, 1898 and 1848 utterances
for training. development and evaluation respectively. Each of the
three sets contains approximately the same number of utterances explicit duration modeling for features, as explained in Section
per language. It is important to note that unlike many previous 4. These techniques were not applied to the phone-based system,
studies which have excluded speech signals shorter than 10 sec- which is why the baseline feature-based system shows a much bet-
onds or only used a subset of the 10 languages, our results are ter performance. However, due to the complementary nature of
based an a IO-way forced choice including all languages and sig- the two approaches, their combination still yields significant im-
nal files of all lengths. provements in LID accuracy (significant at the 0.0002 and 0,002
The first step in our baseline LID system is speechhon-speech level, respectively, using a difference of proportions significance
segmentation, which is performed hy a neural network trained on test). It should he noted that the number of parameters required
a hand-laheled subset of the training data, followed by tempo- for the phone trigrams, i.e. 1333 is much larger than the num-
ral smoothing of the network outputs. The speech segments are ber of parameters required for the all feature trigrams combined,
then convened to 12 mel-frequency cepstral coefficients, log en-
ergy, and first-order temporal derivatives, yielding 26-dimensional
+ +
i.e. 1g3 Z13 Z13 153 U3.+ +
Our next step was to incorporate explicit cross-stream depen-
feature vectors. Based on this acoustic representation, Hidden dencies, with the goal ofnot only obtaining the maximum LID ac-
Markov Models (HMMs) with 3 states and 2 Gaussian mixture curacy but also of minimizing the number of parameters needed.
components each are trained for 26 features grouped into the fol- The entire set of possible dependencies included within-stream
lowing five streams: manner of articulation (man"), consonantal
feature and phone dependencies (i.e. standard n-grams), cross-sream
place of articulation (cpl), vowel place of articulation (vpl), lip dependencies among feature streams, as well as cross-stream de-
rounding (rd) and front-back position of the tongue (fb). Further- pendencies between individual feature streams and the phone stream.
more, 8 models are used for silence and various types of back-
N-gram orders of up to n = 3 were considered. We applied GA-
ground noises. Individual noise models are trained for each fea-
ture group. The total number of models in the feature based sys- based search to the following subsets of dependency models:
tem is 66. The trained acoustic models are then used to gener- s within-stream + cross-stream dependencies for features only
ate feature labels by unconstrained recognition. For each stream, (Group A), i.e. standard feature n-grams plus allowing con-
trigram models with Witten-Bell smoothing are trained using the ditioning features in Streams other than the current one:
SRILM toolkit [6] with an extension module for multi-stream n- as Group A, hut additionally allowing phones to he condi-
gram models implemented by Jeff Bilmes (U Washington). Ex- tioned on features or vice versa (Group B);
plicit duration modeling was incorporated into the n-gram models
by relabeling feature labels in accordance with their temporal du- as Group B, hut additionally including standard phone n-
ration: for each feature, a duration histogram was estimated and all grams (Group C):
labels above the 25th percentile of the distribution were relabeled s standard phone n-gram plus allowing phones to be condi-
as long features, others were relabeled as short features. This split, tioned on features or vice versa; no feature n-grams (Group
as well as the final selection of split labels to he incorporated into
D).
the n-gram modeling component, was optimized on the develop-
ment set. This leads to a total of 89 feature models, arranged in Results are shown in Table 2. We made the following observa-
sets of 19,21,21,15 and I3 models respectively, for the aniculatory tions: first, the accuracy on the development set always increases
feature streams mentioned above. The phone system is trained on significantly, which is not surprising since the fitness function was
the same acoustic representation and contains 133 HMMs with 3 directly designed to maximize accuracy on this set. On the eval-
states and 4 Gaussian mixture components each. Phone trigrams uation set, significant gains (p = 0.05) were obtained for Groups
are used. B and D, compared to the baseline systems. Second. cross-stream
dependencies were selected for Groups A and B, which do not in-
clude phone n-grams among the set of possible models, but not
5. EXPERIMENTSAND RESULTS in C, which does include phone trigrams. It seems that phone
n-grams and feature n-grams plus cross-stream dependencies are
Table 1 shows the baseline error rates for the phone. feature and two different ways of modeling similar information. These two
combined system. System combination was done as specified in approaches are associated with different accuracy-cost tradeoffs:
Equation 5 . In our previous work we concentrated on improv- whereas the inclusion of phone n-grams leads to the best accuracy
ing the feature-based approach by applying techniques such as overall it comes at a significant cost in terms of the number of

1-30
Table 3. LID accuracy (%I. number of system parameters, and
number of selected cross-stream dependencies for greedy depen- Table 5. LID accuracy (%) for duration-specific ngrams plus
dency selection. duration-specific GA search vs. standard GA search.

Our conclusions are that (a) due to the complementary nature of


the two approaches, combining phone and feature-based informa-
tion streams leads to significant improvements in LID accuracy;
(b) further significant improvements in accuracy can be obtained
Table 4. LID accuracy (%) for phone-based, feature-based and by explicitly modeling dependencies across these different streams;
combined LID systems using duration-specific n-grams.
(c) Genetic Algorithms outperform heuristic search for the pur-
pose of selecting the best set of dependency models from the large
space of all possible combinations. Applying GA-based search
to different suhclusters of utterances separately did not yield any
parameters (see Table 2, column 3). The feature-based system, by significant improvement compared to GA search using all avail-
contrast, has a lower accuracy but only a fraction of the parameters able data collectively. We also noticed the danger of over-training
needed for the phone-based system. Finally, in the course of many the GA to the development data, especially when cluster sizes are
experiments using GA-based search we noticed that it is impossi- small. In the future, we intend to incorporate explicit penaliza-
ble to predict which dependencies are selected under which con- tion factors for model complexity into the fitness function. The
ditions, indicating strong non-linear interactions between different approach presented here can be applied to a number of different
conditioning variables. This emphasizes the advantages of using scenarios, e.g. standard phonotactic LID systems that use multi-
the Genetic Algorithm rather than heuristic or other conventional ple phone streams and to recently developed systems for speaker
search techniques far dependency selection. As a comparison. we identification based on similar approaches [7, 81. Furthermore,
ran dependency selection experiments for the three groups using the framework is general enough to he able to accommodate any
greedy search as described in 12). The results (Table 3) show that other information source which can be expressed as sequences
GA-based search did indeed lead to better performance in most of discrete symbols, such as sequences of HMM state indices or
cases. prosodic symbols.
The OGI-TS corpus consists of spontaneous and non-spontaneous Acknowledgements
utterances ranging from a few seconds to a maximum of about The work presented in this paper was funded by US DoD.
one minute. Whereas short utterances include enumerations such
as the days of the week, longer utterances are spontaneous nar-
rations. Utterance duration is therefore related to speaking style 7. REFERENCES
and vocabulary effects. Better results might be obtained if both
n-gram model training and CA search were applied to different [I] M.A. Zissman, Tomparison of four approaches to automatic
duration-specific subsets of utterances separately. We sorted train- language identification of telephone speech,’’ IEEE Trans.
ing utterances into different groups based on their length, Within Speech andAudio Processing, vol. 4(1). pp. 3 1 4 4 , 1996.
these groups, the sets of development utterances were randomly [21 K. Kirchhoff and S. Parandekar, “Multi-stream statistical lan-
suh-sampled to ensure that each language had roughly the same guage modeling with application to automatic language iden-
number of samples, which prevents language-specific biases in the tification.” in Proceedings of Eurospeech-01, 2001, pp. 803-
GA search. For each category, within-stream andlor cross-stream 806.
dependency models trained on utterances in that category alone
were considered for scoring if, on development data, they out- (31 K.Kirchhoff. S. Parandekar. and J. Bilmes. ”Mixed-memorv
performed models trained on the entire training set. The results in
Table 4 show that duration-specific n-grams are indeed beneficial
for LID accuracy. Individual GA searches were then conducted [41 Y.K. Muthusamy et al., ‘The OGI multi-language telephane
separately for each durational category. using the same configu- speech corpus,” in Proceedings of ICSLP-92, 1992.
ration as in Tables 2 and 3. Results from these experiments for
different duration categories are shown in Table 5 . We can see [SI Y.K. Muthusamy, A Segmental Approach to Automatic Lan-
that there is hardly any additional advantage due to using depen- guage Identification, Ph.D. thesis, Oregon Graduate Institute,
1993.
dencency selection conditioned on utterance length - we even see
a decrease in performance for Groups A and C. Since utterances [61 A. Stolcke, “SRILM - an extensihle language modeling
are divided into sub-groups based on their lengths, there may not toolkit,” in Proceedings of ICSLP, 2002.
he enough data in the dev set of each group to obtain generaliz- I71 W. Andrews, M. Kohler, and J. Campbell, “Phonetic speaker
able models and the Genetic Algorithm may be over-training on recognition,” in Proceedings of Eurospeech, 2001, pp. 2517-
the development set. 2520.
I81 J. Qin, T. Schultz, and A. Waibel, “Speaker identification us-
6. SUMMARY AND CONCLUSIONS ing multilingual phone suing:’ in Proceedings of ICASSP,
21332, pp. 145-148.
In this paper we have demonstrated the importance of incorporat-
ing cross-stream dependencies into multi-stream models for LID.

1-31

You might also like