Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
A Machine Learning Framework for Spoken-dialog

A Machine Learning Framework for Spoken-dialog

Ratings: (0)|Views: 29|Likes:
Published by Serge
One of the key tasks in the design of large-scale dialog systems is classification. This consists of assigning, out of a finite set, a specific category to each spoken utterance, based on the output of a speech recognizer. Classification in general is a standard machine learning problem, but the objects to classify in this particular case are word lattices, or weighted automata, and not the fixed-size vectors learning algorithms were originally designed for. This chapter presents a general kernel-based learning framework for the design of classification algorithms for weighted automata. It introduces a family of kernels, rational kernels, that combined with support vector machines form powerful techniques for spoken-dialog classification and other classification tasks in text and speech processing. It describes efficient algorithms for their computation and reports the results of their use in several difficult spoken-dialog classification tasks based on deployed systems. Our results show that rational kernels are easy to design and implement and lead to substantial improvements of the classification accuracy. The chapter also provides some theoretical results helpful for the design of rational kernels.
One of the key tasks in the design of large-scale dialog systems is classification. This consists of assigning, out of a finite set, a specific category to each spoken utterance, based on the output of a speech recognizer. Classification in general is a standard machine learning problem, but the objects to classify in this particular case are word lattices, or weighted automata, and not the fixed-size vectors learning algorithms were originally designed for. This chapter presents a general kernel-based learning framework for the design of classification algorithms for weighted automata. It introduces a family of kernels, rational kernels, that combined with support vector machines form powerful techniques for spoken-dialog classification and other classification tasks in text and speech processing. It describes efficient algorithms for their computation and reports the results of their use in several difficult spoken-dialog classification tasks based on deployed systems. Our results show that rational kernels are easy to design and implement and lead to substantial improvements of the classification accuracy. The chapter also provides some theoretical results helpful for the design of rational kernels.

More info:

Published by: Serge on Nov 11, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/09/2014

pdf

text

original

 
Springer Handbook on Speech Processing and Speech Communication 1
A MACHINE LEARNING FRAMEWORK FOR SPOKEN-DIALOGCLASSIFICATION
Corinna Cortes
11
Google Research76 Ninth AvenueNew York, NY 10011
corinna@google.com
Patrick Haffner 
22
AT&T Labs – Research180 Park AvenueFlorham Park, NJ 07932
haffner@research.att.com
 Mehryar Mohri
3
,
13
Courant Institute251 Mercer StreetNew York, NY 10012
mohri@cims.nyu.edu
ABSTRACT
One of the key tasks in the design of large-scale di-alog systems is classification. This consists of as-signing, out of a finite set, a specific category to eachspokenutterance,basedontheoutputofaspeechrec-ognizer. Classification in general is a standard ma-chine learning problem, but the objects to classifyin this particular case are word lattices, or weightedautomata, and not the fixed-size vectors learning al-gorithms were originally designed for. This chap-ter presents a general kernel-based learning frame-work for the design of classification algorithms forweighted automata. It introduces a family of kernels,
rational kernels
, that combined with support vec-tor machines form powerful techniques for spoken-dialog classification and other classification tasks intext and speech processing. It describes efficient al-gorithmsfor their computationandreports the resultsof their use in several difficult spoken-dialog classi-fication tasks based on deployed systems. Our re-sults show that rational kernelsare easy to design andimplement and lead to substantial improvements of the classification accuracy. The chapter also providessome theoretical results helpful for the design of ra-tional kernels.
1. MOTIVATION
A critical problem for the design of large-scalespoken-dialog systems is to assign a category, out of a finite set, to each spoken utterance. These cate-gories help guide the dialog manager in formulatinga response to the speaker. The choice of categoriesdepends on the application, they could be for exam-ple
referral
or
 pre-certification
fora health-carecom-pany dialog system, or
billing services
or
credit 
foran operator-service system.To determine the category of a spoken utterance,one needs to analyze the output of a speech recog-nizer. Figure 1 is taken from a customer-care appli-cation. It illustrates the output of a state-of-the-artspeech recognizer in a very simple case where thespoken utterance is ”Hi, this is my number”. Theoutput is an acyclic weighted automaton called a
word lattice
. It compactly represents the recognizer’sbest guesses. Each path is labeled with a sequenceof words and has a score obtained by summing theweights of the constituent transitions. The path withthe lowest score is the recognizer’s best guess, in thiscase ”I’d like my card number”.This example make evident that the error rate of conversational speech recognition systems is still toohigh in many tasks to rely only on the one-best out-put of the recognizer. Instead, one can use the fullword lattice which contains the correct transcriptionin most cases. This is indeed the case in Figure 1,since thetoppathis labeledwiththecorrectsentence.Thus, in this chapter, spoken-dialog classification isformulated as the problem of assigning a category toeach word lattice.Classification in general is a standard machinelearning problem. A classification algorithmreceivesa finite number of labeled examples which it uses fortraining, and selects a hypothesis expected to makefew errors on future examples. For the design of modern spoken-dialog systems, this training sample
 
Springer Handbook on Speech Processing and Speech Communication 2
01hi/80.762I/47.363I’d/143.34this/16.85this/153.06this/90.37like/41.58is/70.9710is/22.3687is/71.16is/77.688Mike/192.59my/19.215my/63.09uh/83.34hard/22card/20.157number/34.56
Figure1: Wordlattice outputofaspeechrecognitionsystem forthespokenutterance“Hi, this is mynumber”.is often available. It is the result of careful humanlabeling of spoken utterances with a finite number of pre-determined categories of the type already men-tioned.But, most classification algorithms were origi-nally designed to classify fixed-size vectors. The ob- jects to analyze for spoken-dialog classification areword lattices, each a collection of a large numberof sentences with some weight or probability. Howcan standard classification algorithms such as sup-port vector machines [Cortes and Vapnik, 1995] beextended to handle such objects?This chapter presents a general framework andsolution for this problem, which is based on
kernelsmethods
[Boser et al., 1992, Sch¨olkopf and Smola,2002]. Thus, we shall start with a brief introductionto kernel methods (Section 2). Section 3 will thenpresent a kernel framework,
rational kernels
, that isappropriate for word lattices and other weighted au-tomata. Efficient algorithms for the computation of these kernels will be described in Section 4. Wealso report the results of our experiments using thesemethods in several difficult large-vocabularyspoken-dialogclassification tasks based ondeployedsystemsin Section 5. There are several theoretical resultsthat canguidethe designofkernelsforspoken-dialogclassification. These results are discussed in Sec-tion 6.
2. INTRODUCTION TO KERNEL METHODS
Let us start with a very simple two-group classi-fication problem illustrated by Figure 2 where onewishes to distinguish two populations, the blue andred circles. In this very simple example, one canchoose a hyperplane to correctly separate the twopopulations. But, there are infinitely many choicesfor the selection of that hyperplane. There is goodtheory though supporting the choice of the hyper-plane that maximizes the
margin
, that is the distancebetween each population and the separating hyper-plane. Indeed, let
denote the class of real-valuedfunctions on the ball of radius
R
in
R
:
=
{
x
w
·
x
:
w
 ≤
1
,
x
 ≤
R
}
.
(1)Then, it can be shown [Bartlett and Shawe-Taylor,1999] that there is a constant
c
such that, for all dis-tributions
D
over
, with probabilityat least
1
δ
, if a classifier
sgn(
)
, with
, has margin at least
ρ
over
m
independently generated training examples,then the generalization error of 
sgn(
)
, or error onany future example, is no more than
cm
R
2
ρ
2
log
2
m
+ log1
δ
.
(2)This bound justifies large-margin classification algo-rithms such as support vector machines (SVMs). Let
w
·
x
+
b
= 0
be the equation of the hyperplane,where
w
R
is a vector normal to the hyperplaneand
b
R
a scalaroffset. Theclassifier
sgn(
h
)
corre-sponding to this hyperplane is unique and can be de-fined with respect to the training points
x
1
,...,x
m
:
h
(
x
) =
w
·
x
+
b
=
m
i
=1
α
i
(
x
i
·
x
) +
b,
(3)
 
Springer Handbook on Speech Processing and Speech Communication 3
(a) (b)Figure 2: Large-margin linear classification. (a) An arbitrary hyperplane can be chosen to separate the twogroups. (b) The maximal-margin hyperplane provides better theoretical guarantees.where
α
i
s are real-valued coefficients. The mainpoint we are interested in here is that both for theconstruction of the hypothesis and the later use of that hypothesis for classification of new examples,one needs only to compute a number of dot productsbetween examples.In practice, non-linear separation of the trainingdata is often not possible. Figure 3(a) shows an ex-ample where any hyperplane crosses both popula-tions. However, one can use more complex functionsto separate the two sets as in Figure 3(b). One way todo that is to use a non-linear mapping
Φ :
fromtheinputspace
toahigher-dimensionalspace
where linear separation is possible.The dimension of 
can truly be very large inpractice. For example, in the case of document clas-sification,onemayuseasfeatures,sequencesofthreeconsecutive words (trigrams). Thus, with a vocabu-lary of just 100,000 words, the dimension of the fea-ture space
is
10
15
. On the positive side, as in-dicated by the error bound of Equation 2, the gen-eralization ability of large-margin classifiers such asSVMs does not depend on the dimension of the fea-turespacebutonlyonthemargin
ρ
andthenumberof training examples
m
. However, taking a large num-ber of dot products in a very high-dimensional spaceto define the hyperplane may be very costly.A solution to this problem is to use the so-called’kernel trick’ or
kernel methods
. The idea is to definea function
:
×
R
called a
kernel
, suchthat the kernel function on two examples
x
and
y
ininput space,
(
x,y
)
, is equal to the dot product of two examples
Φ(
x
)
and
Φ(
y
)
in feature space:
x,y
X,
(
x,y
) = Φ(
x
)
·
Φ(
y
)
.
(4)
is often viewed as a similarity measure. A crucialadvantage of 
is efficiency: there is no need any-more to define and explicitly compute
Φ(
x
)
,
Φ(
y
)
,and
Φ(
x
)
·
Φ(
y
)
. Another benefit of 
is flexibil-ity:
can be arbitrarily chosen so long as the ex-istence of 
Φ
is guaranteed, which is called Mercer’scondition. This condition is important to guaranteethe convergence of training for algorithms such asSVMs.
1
A condition equivalent to Mercer’s condition isthat the kernel
be
positive definite and sym-metric
, that is, in the discrete case, the matrix
(
(
x
i
,x
j
))
1
i,j
n
must be symmetric and positivesemi-definite for any choice of 
n
points
x
1
,...,x
n
in
. Said differently, the matrix must be symmetricand its eigenvalues non-negative. Thus, for the prob-lem that we are interested in, the question is how todefine positive definite symmetric kernels for wordlattices or weighted automata.
3. RATIONAL KERNELS
This section introduces a family of kernels forweighted automata,
rational kernels
. We will startwith some preliminary definitions of automata and
1
Some standard Mercer kernels over a vector space are thepolynomial kernels of degree
d
N
,
d
(
x,y
) = (
x
·
y
+1)
d
, andGaussian kernels
σ
(
x,y
) = exp(
−
x
y
2
2
)
,
σ
R
+
.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->