Professional Documents
Culture Documents
SELECTION
Sonia Parandekar, Katrin Kirchhoff
{sonia,katrin}@ee.washington.edu
Department of Electrical Engineering
University of Washington, Seattle, USA
ABSTRACT speech signal and arc arranged into five separate groups (man-
ner of articulation, consonanralplace of articulation, vowel place
The most widespread approach to automatic language identi- of articulation, front-back tongue position and lip rounding) and
fication in the past has been the statistical modeling of phone se- Acoustic models are built for each feature value, analogous to
quences extracted from speech signals. Recently, we have devel- acoustic phone models. Using these models, parallel streams of
oped an alternative approach to LID based on n-gram modeling of feature sequences, one for each feature group, are derived from a
parallel streams of articulatory features, which was shown to have given speech signal. Language-specific n-gram models are trained
advantages over phone-based systems on short test signals whereas for each feature stream. Analogous to the n-gram probability of a
the latter achieved a higher accuracy on longer signals. Addition- phone sequence, the probability of a feature sequence in a pmicu-
ally, phone and feature streams can be combined to achieve max- lar feature stream 3 = f i , ...fN given a language L, P ( 3 ' J L )is,
imum performance. Within this "multi-scream" framework two defined as:
types of statistical dependencies need to be modeled: (a) depen-
dencies between symbols in individual streams and (b) dependen- N
cies between symbols in different streams. The space of possible
dependencies is typically too large to be searched exhaustively. I n P(31L) = n P ( f i 1 f i - 1,..., fi-n+i,L) (3)
this paper, we explore the use of genetic algorithms as a method for i=n
data-driven dependency selection. The result is a general frame-
work for the discovery and modeling of dependencies between The probability of an ensemble of K feature Streams 3 1 , ..., 3 K
multiple information sources expressed as sequences of symbols, given language L, P ( h ,...,FKKJL),
is defined as:
which has implications for other fields beyond language identifi-
cation, such as speaker identification or language modeling. L )C(P(3ilL),, ..., p(3~1.G)) (4)
P ( 3 1 ,..., ~ K K J =
1. INTRODUCTION N-GRAM MODEL APPROACHES where C is some combination function, e.g. the product rule
TO LANGUAGE IDENTIFICATION
Automatic language identification (LID) continues to be of con-
siderable imponance for multilingual speech applications. Several
k=l
approaches to LID have been developed in the past which make
use of acoustic, prosodic, phonetic-phonotactic or lexical infor-
mation. Of these, the phonotactic approach (e.g.[l]) has emerged In our previous work we showed that both feature-based and phone-
as the most widespread and flexible technique. This approach as- based approaches achieved comparable performance overall but
sumes that language-discriminating information is encoded in the that the feature-based system obtained a significantly higher per-
statistical regularities of phone sequences in different languages. formance on very short test signals (<_3sec. ) whereas the phone-
As a first step, the speech signal is mapped to a sequence of phone based system achieved a higher accuracy on longer test signals.
symbols, 0 = $ l , d ~...,, +N, using acoustic models such as Hid- Due to the complementary nature of the two approaches, they can
den Markov Models (HMMs). Statistical n-gram models are then be combined to achieve maximum performance. A seamless way
trained on the resulting phone labels. An n-gram model specifies a of integrating the phone and feature-based systems is to treat the
set of probability distributions of a phone given a context of n - 1 stream of phones as an additional stream within the set of artic-
and the language L: ulatory feature streams. The equation for language classification
would now become
N
~ +NIL) = f l P ( d h L i , ...,d i - n t i , L )
P ( h , d..., (1) L' = arpas'P(F1, ...,FK,6(L) (6)
i=n
Naturally, this can be extended to Using multiple phone sequences,
During language identification, the phone sequence derived from as is the standard in some phone-based LID systems. In all "multi-
the test speech signal is scored against each of the language-specific stream" models of this type, two sets of dependencies need to be
n-gram models. The language of the n-gram model for which the modeled (a) dependencies within individual streams, and (b) de-
highest score is obtained is then hypothesized as the true language pendencies across different streams. The model in Equation 5 a.-
(Le): sumes that all streams are independent given the language and thus
L' = argmazLP(di,dz,..., $NIL) (2) ignores dependencies of type (b), which is clearly an oversimpli-
In our recently developed feature-based approach [Z, 31, artic- fication. The main objective of this study is to explore how statis-
ulatory feature sequences are used in place of phone sequences. tical dependencies between different information streams can be
These features characterize different articulatory properties of the detected and modeled more adequately.
I - 29
dependent variable such that no circular dependencies are possi- Set Phone Feature Combined
ble. The individual streams are arranged in an ordered tuple, for ___
3 G z r T 5 8 1 9 6-64.54
example: < j,k,1, m >. Any feature variable can he conditioned Eval. set 5773062.17
only on other feature variables in its own stream and those pre- #params 2.35M 30K 2.38M
ceding it in the tuple. Though this effectively implies that the GA
search will now he exploring a more restricted search space, dif-
ferent orderings of the set of features can he evaluated to get the
optimal set of dependencies. Since the final goal is to improve
on language identification accuracy, we use LID performance on a
held-out development set as the fitness function.
1-30
Table 3. LID accuracy (%I. number of system parameters, and
number of selected cross-stream dependencies for greedy depen- Table 5. LID accuracy (%) for duration-specific ngrams plus
dency selection. duration-specific GA search vs. standard GA search.
1-31