Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
0Activity
0 of .
Results for:
No results containing your search query
P. 1
Lease06 Sap

Lease06 Sap

Ratings: (0)|Views: 3|Likes:
Published by Saleem A.hadyq

More info:

Published by: Saleem A.hadyq on Jun 28, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

06/28/2012

pdf

text

original

 
1566 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006
Recognizing Disfluencies in Conversational Speech
Matthew Lease
 , Student Member, IEEE 
, Mark Johnson, and Eugene Charniak 
 Abstract—
We present a system for modeling disfluency inconversational speech: repairs, fillers, and self-interruption points(IPs). For each sentence, candidate repair analyses are generatedby a stochastic tree adjoining grammar (TAG) noisy-channelmodel. A probabilistic syntactic language model scores the fluencyof each analysis, and a maximum-entropy model selects the mostlikely analysis given the language model score and other features.Fillers are detected independently via a small set of deterministicrules, and IPs are detected by combining the output of repair andfiller detection modules. In the recent Rich Transcription Fall2004 (RT-04F) blind evaluation, systems competed to detect thesethree forms of disfluency under two input conditions: a best-casescenario of manually transcribed words and a fully automatic caseof automatic speech recognition (ASR) output. For all three tasksand on both types of input, our system was the top performer inthe evaluation.
 Index Terms—
Disfluency modeling, natural language pro-cessing, rich transcription, speech processing.
I. I
NTRODUCTION
I
N spontaneous speech, speakers form their utterances on thefly and subject to various cognitive constraints and prag-matic factors. As a result, such speech is often
disfluent 
, con-taining partial words (or word fragments), fillers
1
(e.g., , ,, ,etc.),andrepairs,suchasthatshowninFig.1.While the presence of disfluencies rarely poses any compre-hension problem to an engaged listener, they have been shownto negatively impact both the readability of transcribed speech[2] and the accuracy of automated analysis performed on it [3]. Such findings demonstrate the importance of moving beyondbare stream-of-words automatic speech recognition (ASR) toinstead generate
enriched transcriptions
, in which disfluenciesand sentence
2
boundaries are recognized in addition to words.This line of research has been notably pursued in recent years
Manuscript received September 30, 2005; revised April 26, 2006. This work was supported by the National Science Foundation under Grants LIS 9720368andIIS0095940.Aversionofthis paperwaspresented attheRichTranscriptionFall 2004 Workshop (RT-04F). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. John Makhoul.M. Lease and E. Charniak are with the Department of Computer Science,Brown Laboratory for Linguistic Information Processing (BLLIP), BrownUniversity, Providence, RI 02912 USA (e-mail: mlease@cs.brown.edu,ec@cs.brown.edu).M. Johnson is with the Department of Cognitive and Linguistic Sciences,Brown Laboratory for Linguistic Information Processing (BLLIP), Brown Uni-versity, Providence, RI 02912 USA (e-mail: mj@cs.brown.edu).Digital Object Identifier 10.1109/TASL.2006.878269
1
Whiletheterm
 filler 
hastraditionallybeensynonymouswith
 filledpause
,weadoptrecentLinguisticDataConsortium(LDC)SimpleMDE[1]conventionsinusing the term to describe a broad set of vocalized space-fillers which includesfilled pauses.
2
We use
sentence
and
sentence-like unit 
(SU) interchangeably in this text.While SU is more precise [1], the notion of sentences is widely understood.Fig. 1. Structure of a typical repair, with crossing dependencies betweenreparandum and alteration.
via the Defense Advanced Research Projects Agency (DARPA)EARS program.In previous work, we presented two systems for detectingspeech repairs [3], [4], showing that a noisy-channel model based on a tree adjoining grammar (TAG) outperformed anearlier word-based classifier. In this paper, we add supportfor filler and self-interruption point (IP) detection and furtherimprove repair detection via maximum-entropy modeling.The overall architecture of our system is sketched in Fig. 2.The TAG model (Section II) takes tokenized
3
input sentencesand proposes candidate repair analyses, where each analysisidentifies zero or more hypothesized repair locations for a givensentence.Foreachanalysis,thefluencyoftheremaining,nonre-pair words in the sentence is scored using a syntactic languagemodel (Section III). Given these analyses and their associatedscores, a maximum-entropy classifier (Section IV) is then usedto rerank the hypotheses and select the most probable one. Useofmaximum-entropymodelingpermitsustouseawiderangeof additional information to identify the best sentence-level anal-ysis, something that would be harder to realize in the first-stagegenerative model.A limitation of the system described above is that the onlyform of disfluency modeled is speech repairs. While the TAGalso identifies fillers involved in speech repairs, most fillers ac-tually occur outside of repair contexts and so are not identifiedby that model. To address this, we augmented our TAG-basedmodel with a small set of hand-crafted, deterministic rules forfiller detection (Section V). While our eventual goal is to in-corporate filler detection directly into the noisy-channel model,these simple rules have provento be surprisingly effective.Sup-port for IP detection was added via deterministic combinationof repair and filler predictions (Section VII).Section VI describes data used in training and evaluating oursystem, and Section VII presents system performance in theRich Transcription Fall 2004 (RT-04F) blind evaluation [5]. Aspart of this evaluation, system performance was measured ontwotypesofinput:abest-caseconditionofmanuallytranscribedwords and a fully automatic case of ASR output. For all threetasks and on both types of input, our system was the top per-former in the evaluation. The results on ASR output are particu-larly encouraging in showing that syntactic language modelingfeatures continue to be useful even in the presence of significantword recognition errors (Table II).
3
http://www.cis.upenn.edu/~treebank/tokenizer.sed1558-7916/$20.00 © 2006 IEEE
 
LEASE
et al.
: RECOGNIZING DISFLUENCIES IN CONVERSATIONAL SPEECH 1567
Fig. 2. Overall system architecture. Roman script identi
es processing steps,while italic script identi
es data. Input IP probabilities serve as additionalfeatures for reranking (Section IV).
II. TAG C
HANNEL
M
ODEL
The TAG model is responsible for proposing candidate re-pair analyses for each sentence. We follow previous work onspeech repairs [6], [7] in dividing a repair into three parts: the
reparandum
(the material repaired), the
interregnum
(or editingterm) that typically consists of zero or more
llers, and the
al-teration
. Fig. 1 shows these three parts for a typical repair. Asthis
gure shows, the alteration can often be understood as a
rough copy
of the reparandum, using the same or very similarwords in roughly the same order [7]. In other words, a speechrepair seems to involve cross-serial dependencies between thereparandum and the alteration, as shown in Fig. 1. Languageswith an unbounded number of cross-serial dependencies cannotbe described by a context-free or
nite-state grammars, andcrossed dependencies like these have been used to argue nat-ural languages are not context-free languages [8]. Fortunately,mildlycontext-sensitivegrammarformalismssuchasTAGsandcombinatorycategorialgrammarscanmodelcross-serialdepen-dencies, and this is why we adopt a TAG-based model here.Fig. 3 shows our TAG model
s dependency structure for therepair of Fig. 1. Interestingly, if we trace the temporal wordstring through this dependency structure, aligning words nextto the words they are dependent on, we obtain a
helical
typeof structure familiar from genomics, and in fact TAGs are beingused to model genomes for very similar reasons [9].To effectively model both the crossed-dependencies of re-pairs and the more usual linear or tree-structured dependencies
Fig. 3. The
helical
dependency structure induced by the generative modelof speech repairs for the repair depicted in Fig. 1.
ofnonrepairedspeech,weadoptthenoisychannelparadigm[7],[10]. We begin by imagining that speakers intend to say sourcesentences (with no repairs), but may mistakenly insert one ormore repairs, producing observed sentences . Our goal is foreach observed sentence to recover the most likely source sen-tence . Applying Bayes Rule, we can formulate this problemin canonical noisy channel formThe
channel model
de
nes a stochastic mapping of source sentences into observed sentences via the optional in-sertion of one or more repairs, and the
language model
de
nes a probability distribution over source sentences. This isthe same general setup that has also been effectively applied tospeech recognition [11] and machine translation [12]. In our ap- plication, the channel model is realized as a TAG, and we trainoursyntacticlanguagemodel(SectionIII)onsentenceswiththespeech repairs removed.The channel model nondeterministically predicts repairs atevery position in the input string, conditioned on the precedingword. Because of this conditioning, it captures the well-knowneffect that repairs are more likely to occur at the beginning of a sentence than elsewhere. The interregnum is generated by asimple unigram model over words or pairs of words such as
 I mean
, etc. Each reparandum word is generated conditionedon the word preceding it and the word in the alteration stringcorresponding to it. The TAG transducer is effectively a simple
rst-order Markov model (albeit one that captures string non-local dependencies).
 A. Formal Description
The TAG channel model de
nes a stochastic mapping of source sentences into observed sentences . There areseveral ways to de
ne transducers using TAGs such as [13], butthe following simple method inspired by
nite-state transducerssuf 
ces for the application here. The TAG de
nes a languagewhose vocabulary is the set of pairs ,where is the vocabulary of the observed sentences . Astring in this language can be interpreted as a pair of strings, where is the concatenation of the projection of the
rst components of and is the concatenation of theprojection of the second components. For example, the string
: a
fl 
ight:
 fl 
ight to: Boston: uh: I: mean: to:to Denver:Denver on:on Friday:Friday
has the observed string
 fl 
ight to Boston uh I mean to Denver on Friday
and thesource string
fl 
ight to Denver on Friday
. Fig. 4 showsthe TAG rules used to generate this example.
 
1568 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 4. TAG rules used to generate the example shown in Fig. 1, their respective weights, and the corresponding derivation and derived trees. The nonterminalsin this grammar are of the form
,
, and
, where
is a word appearing in the source string, and
is a word appearing in the observed string.Informally, the
nonterminals indicate that the preceding word
was analyzed as not being part of a repair, while the
that the preceding words
and
were part of a repair with
in the alteration corresponding to
in the reparandum. The nonterminal
generates words in the interregnum of a repair.
indicates the foot node in an auxiliary (repair) tree. Encoding the preceding words in the TAG
s nonterminals permits the channel model to be sensitive tolexical properties of the preceding words. The start symbol is
, where
$
is a distinguished symbol used to indicate the beginning and end of sentences. Detailsof the associated conditional probability distributions above can be found in [4].
 B. Model Estimation
To estimate weights on the TAG productions described inSection II-A, we need to know the TAG derivation of eachsentenceinthetrainingdata.Uniquelydeterminingthisrequiresknowingboththelocationsofeachreparandum,interregnum,andalterationregionsofeachrepair(whichareannotated
seeSec-tionVI)andthecrossingdependenciesbetweenthereparandumand alteration words (as indicated in Fig. 1), which are notannotated. We obtain the latter by aligning the reparandum andrepair strings of each repair using a minimum-edit distancestring aligner like [14] with the following alignment costs:0 identical words;2 words with the same POS tag;4 an insertion or a deletion;5 words with POS tags that begin with the same letter;7 an arbitrary substitution.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->