• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural LanguageProcessing (HLT/EMNLP)
, pages 193–200, Vancouver, October 2005.c
2005 Association for Computational Linguistics
Predicting Sentences using N-Gram Language Models
Steffen Bickel, Peter Haider, and Tobias Scheffer
Humboldt-Universit¨at zu BerlinDepartment of Computer ScienceUnter den Linden 6, 10099 Berlin, Germany
{
bickel, haider, scheffer
}
@informatik.hu-berlin.de
Abstract
We explore the benefit that users in sev-eral application areas can experience froma “tab-complete” editing assistance func-tion. We develop an evaluation metricand adapt
-gram language models tothe problem of predicting the subsequentwords, given an initial text fragment. Us-ing an instance-based method as base-line, we empirically study the predictabil-ity of call-center emails, personal emails,weather reports, and cooking recipes.
1 Introduction
Prediction of user behavior is a basis for the con-struction of assistance systems; it has therefore beeninvestigated in diverse application areas. Previousstudies have shed light on the predictability of thenext unix command that a user will enter (Motodaand Yoshida, 1997; Davison and Hirsch, 1998), thenext keystrokes on a small input device such as aPDA (Darragh and Witten, 1992), and of the trans-lation that a human translator will choose for a givenforeign sentence (Nepveu et al., 2004).We address the problem of predicting the subse-quent words, given an initial fragment of text. Thisproblem is motivated by the perspective of assis-tance systems for repetitive tasks such as answer-ing emails in call centers or letters in an adminis-trative environment. Both instance-based learningand
-gram models can conjecture completions of sentences. The use of 
-gram models requires theapplication of the Viterbi principle to this particulardecoding problem.Quantifying the benefit of editing assistance to auser is challenging because it depends not only onan observed distribution over documents, but alsoon the reading and writing speed, personal prefer-ence, and training status of the user. We developan evaluation metric and protocol that is practical,intuitive, and independent of the user-specific trade-off between keystroke savings and time lost due todistractions. We experiment on corpora of service-center emails, personal emails of an Enron execu-tive, weather reports, and cooking recipes.The rest of this paper is organized as follows.We review related work in Section 2. In Section 3,we discuss the problem setting and derive appropri-ate performance metrics. We develop the
-gram-based completion method in Section 4. In Section 5,we discuss empirical results. Section 6 concludes.
2 Related Work
Shannon (1951) analyzed the predictability of se-quences of letters. He found that written Englishhas a high degree of redundancy. Based on this find-ing, it is natural to ask whether users can be sup-ported in the process of writing text by systems thatpredict the intended next keystrokes, words, or sen-tences. Darragh and Witten (1992) have developedan
interactive keyboar
that uses the sequence of past keystrokes to predict the most likely succeed-ing keystrokes. Clearly, in an unconstrained applica-tion context, keystrokes can only be predicted withlimited accuracy. In the specific context of enteringURLs, completion predictions are commonly pro-
193
 
vided by web browsers (Debevc et al., 1997).Motoda and Yoshida (1997) and Davison andHirsch (1998) developed a Unix shell which pre-dicts the command stubs that a user is most likelyto enter, given the current history of entered com-mands. Korvemaker and Greiner (2000) have de-veloped this idea into a system which predicts en-tire command lines. The Unix command predic-tion problem has also been addressed by Jacobs andBlockeel (2001) who infer macros from frequentcommand sequences and predict the next commandusing variable memory Markov models (Jacobs andBlockeel, 2003).In the context of 
natural language
, several typ-ing assistance tools for apraxic (Garay-Vitoria andAbascal, 2004; Zagler and Beck, 2002) and dyslexic(Magnuson and Hunnicutt, 2002) persons have beendeveloped. These tools provide the user with a list of possible word completions to select from. For theseusers, scanning and selecting from lists of proposedwords is usually more efficient than typing. By con-trast, scanning and selecting from many displayedoptions can slow down skilled writers (Langlais etal., 2002; Magnuson and Hunnicutt, 2002).Assistancetoolshavefurthermorebeendevelopedfor translators. Computer aided translation systemscombine a translation and a language model in orderto provide a (human) translator with a list of sug-gestions (Langlais et al., 2000; Langlais et al., 2004;Nepveu et al., 2004). Foster et al. (2002) introducea model that adapts to a user’s typing speed in or-der to achieve a better trade-off between distractionsand keystroke savings. Grabski and Scheffer (2004)have previously developed an indexing method thatefficiently retrieves the sentence from a collectionthat is most similar to a given initial fragment.
3 Problem Setting and Evaluation
Given an initial text fragment, a predictor that solvesthe sentence completion problem has to conjecture
as much of the sentence that the user currently in-tends to write
, as is possible with high confidencepreferably, but not necessarily, the entire remainder.The perceived benefit of an assistance system ishighly subjective, because it depends on the expen-diture of time for scanning and deciding on sug-gestions, and on the time saved due to helpful as-sistance. The user-specific benefit is influenced byquantitative factors that we can measure. We con-struct a system of two conflicting performance indi-cators: our definition of 
precision
quantifies the in-verse risk of unnecessary distractions, our definitionof 
recall
quantifies the rate of keystroke savings.For a given sentence fragment, a completionmethod may – but need not – cast a completion con- jecture. Whether the method suggests a completion,and how many words are suggested, will typicallybe controlled by a confidence threshold. We con-sider the entire conjecture to be falsely positive if atleast one word is wrong. This harsh view reflectsprevious results which indicate that selecting, andthen editing, a suggested sentence often takes longerthan writing that sentence from scratch (Langlais etal., 2000). In a conjecture that is entirely acceptedby the user, the entire string is a true positive. Aconjecture may contain only a part of the remainingsentence and therefore the
recall
, which refers to thelength of the missing part of the current sentence,may be smaller than 1.For a given test collection, precision and recallare defined in Equations 1 and 2.
Recall
equalsthe fraction of saved keystrokes (disregarding theinterface-dependent single keystroke that is mostlikely required to accept a suggestion);
precision
isthe ratio of characters that the users have to scanfor each character they accept. Varying the confi-dence threshold ofa sentence completion method re-sultsina
 precisionrecallcurve
thatcharacterizesthesystem-specific trade-off between
keystroke savings
and
unnecessary distractions
.
Precision
=
accepted completions
string length
suggested completions
string length(1)
Recall
=
accepted completions
string length
all queries
length of missing part(2)
4 Algorithms for Sentence Completion
In this section, we derive our solution to the sen-tence completion problem based on linear interpola-tion of 
-gram models. We derive a
k
best Viterbidecoding algorithm with a confidence-based stop-ping criterion which conjectures the words that mostlikely succeed an initial fragment. Additionally, we
194
 
briefly discuss an instance-based method that pro-vides an alternative approach and baseline for ourexperiments.In order to solve the sentence completion problemwith an
-gram model, we need to find the mostlikely word sequence
w
t
+1
,...,w
t
+
given a word
-gram model and an initial sequence
w
1
,...,w
t
(Equation 3). Equation 4 factorizes the joint proba-bility of the missing words; the
-th order Markovassumption that underlies the
-gram model simpli-fies this expression in Equation 5.
argmax
w
t
+1
,...,w
t
+
(
w
t
+1
,...,w
t
+
|
w
1
,...,w
t
)
(3)
= argmax
w
t
+1
,...,w
t
+
j
=1
(
w
t
+
j
|
w
1
,...,w
t
+
j
1
)
(4)
= argmax
j
=1
(
w
t
+
j
|
w
t
+
j
+1
,...,w
t
+
j
1
)
(5)
TheindividualfactorsofEquation5areprovidedbythemodel. TheMarkovorder
hastobalancesuffi-cient context information and sparsity of the trainingdata. A standard solution is to use a weighted linearmixture of 
-gram models,
1
n
, (Brown etal., 1992). We use an EM algorithm to select mixingweights that maximize the generation probability of a tuning set of sentences that have not been used fortraining.We are left with the following questions: (a)how can we decode the most likely completion
effi-ciently
; and (b) how many words should we predict?
4.1 Efficient Prediction
We have to address the problem of finding themost likely completion,
argmax
w
t
+1
,...,w
t
+
(
w
t
+1
,...,w
t
+
|
w
1
,...,w
t
)
efficiently
, eventhough the size of the
search space
grows exponen-tially in the number of predicted words.We will now identify the recursive structure inEquation 3; this will lead us to a Viterbi al-gorithm that retrieves the most likely word se-quence. We first define an auxiliary variable
δ
t,s
(
w
1
,...,w
|
w
t
+2
,...,w
t
)
in Equation 6; itquantifies the greatest possible probability over allarbitrary word sequences
w
t
+1
,...,w
t
+
s
, followedbythewordsequence
w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
=
w
, conditioned on the initial word sequence
w
t
+2
,...,w
t
.In Equation 7, we factorize the last transition andutilize the
-th order Markov assumption. In Equa-tion 8, we split the maximization and introduce anew random variable
w
0
for
w
t
+
s
. We can now referto the definition of 
δ
and see the recursion in Equa-tion 9:
δ
t,s
depends only on
δ
t,s
1
and the
-grammodel probability
(
w
|
w
1
,...,w
1
)
.
δ
t,s
(
w
1
,...,w
|
w
t
+2
,...,w
t
)
(6)
= max
w
t
+1
,...,w
t
+
s
(
w
t
+1
,...,w
t
+
s
,w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
=
w
|
w
t
+2
,...,w
t
)= max
w
t
+1
,...,w
t
+
s
(
w
|
w
1
,...,w
1
)
(7)
(
w
t
+1
,...,w
t
+
s
,w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
1
=
w
1
|
w
t
+2
,...,w
t
)= max
w
0
max
w
t
+1
,...,w
t
+
s
1
(
w
|
w
1
,...,w
1
)
(8)
(
w
t
+1
,...,w
t
+
s
1
,w
t
+
s
=
w
0
,...,w
t
+
s
+
1
=
w
1
|
w
t
+2
,...,w
t
)= max
w
0
(
w
|
w
1
,...,w
1
)
δ
t,s
1
(
w
0
,...,w
1
|
w
t
+
2
,...,w
t
)
(9)
Exploiting the
-th order Markov assumption,we can now express our target probability (Equation3) in terms of 
δ
in Equation 10.
max
w
t
+1
,...,w
t
+
(
w
t
+1
,...,w
t
+
|
w
t
+2
,...,w
t
)
(10)
= max
w
1
,...,w
N
δ
t,T 
(
w
1
,...,w
|
w
t
+2
,...,w
t
)
The last
words in the most likely sequenceare simply the
argmax
w
1
,...,w
δ
t,T 
(
w
1
,...,w
|
w
t
+2
,...,w
t
)
. In order to collect the precedingmost likely words, we define an auxiliary variable
Ψ
in Equation 11 that can be determined in Equation12. We have now found a Viterbi algorithm that islinear in
, the completion length.
Ψ
t,s
(
w
1
,...,w
|
w
t
+2
,...,w
t
)
(11)
= argmax
w
t
+
s
max
w
t
+1
,...,w
t
+
s
1
(
w
t
+1
,...,w
t
+
s
,w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
=
w
|
w
t
+2
,...,w
t
)= argmax
w
0
δ
t,s
1
(
w
0
,...,w
1
|
w
t
+2
,...,w
t
)
(
w
|
w
1
,...,w
1
)
(12)
The Viterbi algorithm starts with the most recentlyentered word
w
t
and moves iteratively into the fu-ture. When the
-th token in the highest scored
δ
isa period, then we can stop as our goal is only to pre-dict (parts of) the current sentence. However, since
195
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...