• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
A Classification Approach to Word Prediction*
Yair Even-ZoharDan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
{evenzoha, danr}~uiuc, edu
AbstractThe eventual goal of a language model is to accu-rately predict the value of a missing word given itscontext. We present an approach to word predictionthat is based on learning a representation for eachword as a function of words and linguistics pred-icates in its context. This approach raises a fewnew questions that we address. First, in order tolearn good word representations it is necessary touse an expressive representation of the context. Wepresent a way that uses external knowledge to gener-ate expressive context representations, along with alearning method capable of handling the large num-ber of features generated this way that can, poten-tially, contribute to each prediction. Second, sincethe number of words "competing" for each predic-tion is large, there is a need to "focus the attention"on a smaller subset of these. We exhibit the contri-bution of a "focus of attention" mechanism to theperformance of the word predictor. Finally, we de-scribe a large scale experimental study in which theapproach presented is shown to yield significant im-provements in word prediction tasks.
1 Introduction
The task of predicting the most likely word based onproperties of its surrounding context is the archetyp-ical prediction problem in natural language process-ing (NLP). In many NLP tasks it is necessary to de-termine the most likely word, part-of-speech (POS)tag or any other token, given its history or context.Examples include part-of speech tagging, word-sensedisambiguation, speech recognition, accent restora-tion, word choice selection in machine translation,context-sensitive spelling correction and identifyingdiscourse markers. Most approaches to these prob-lems are based on
n-gram-like
modeling. Namely,the learning methods make use of features which areconjunctions of typically (up to) three consecutivewords or POS tags in order to derive the predictor.In this paper we show that incorporating addi-tional information into the learning process is very* This research is supported by NSF grants IIS-9801638 andSBR-987345.beneficial. In particular, we provide the learner witha rich set of features that combine the informationavailable in the local context along with shallowparsing information. At the same time, we studya learning approach that is specifically tailored forproblems in which the potential number of featuresis very large but only a fairly small number of themactually participates in the decision. Word predic-tion experiments that we perform show significantimprovements in error rate relative to the use of thetraditional, restricted, set of features.BackgroundThe most influential problem in motivating statis-tical learning application in NLP tasks is that ofword selection in speech recognition (Jelinek, 1998).There, word classifiers are derived from a probabilis-tic language model which estimates the probabilityof a sentence s using Bayes rule as the product ofconditional probabilities,
Pr(s) - =
-- H~=lPr(wi]wl,...Wi_l )-- H?=lPr(wi[hi )
where hi is the relevant
history
when predicting wi.Thus, in order to predict the most likely word in agiven context, a global estimation of the sentenceprobability is derived which, in turn, is computedby estimating the probability of each word given itslocal context or history. Estimating terms of theform
Pr(wlh )
is done by assuming some generativeprobabilistic model, typically using Markov or otherindependence assumptions, which gives rise to es-timating conditional probabilities of n-grams typefeatures (in the word or POS space). Machine learn-ing based classifiers and maximum entropy modelswhich, in principle, are not restricted to features ofthese forms have used them nevertheless, perhapsunder the influence of probabilistic methods (Brill,1995; Yarowsky, 1994; Ratnaparkhi et al., 1994).It has been argued that the information availablein the local context of each word should be aug-mented by global sentence information and even in-formation external to the sentence in order to learn
124
 
better classifiers and language models. Efforts inthis directions consists of (1) directly adding syn-tactic information, as in (Chelba and Jelinek, 1998;Rosenfeld, 1996), and (2) indirectly adding syntac-tic and semantic information, via similarity models;in this case n-gram type features are used when-ever possible, and when they cannot be used (dueto data sparsity), additional information compiledinto a similarity measure is used (Dagan et al.,1999). Nevertheless, the efforts in this direction sofar have shown very insignificant improvements, ifany (Chelba and Jelinek, 1998; Rosenfeld, 1996).We believe that the main reason for that is that in-corporating information sources in NLP needs to becoupled with a learning approach that is suitable forit.Studies have shown that both machine learningand probabilistic learning methods used in NLPmake decisions using a linear decision surface overthe feature space (Roth, 1998; Roth, 1999). In thisview, the feature space consists of simple functions(e.g., n-grams) over the the original data so as toallow for expressive enough representations using asimple functional form (e.g., a linear function). Thisimplies that the number of potential features thatthe learning stage needs to consider may be verylarge, and may grow rapidly when increasing the ex-pressivity of the features. Therefore a feasible com-putational approach needs to be feature-efficient. Itneeds to tolerate a large number of potential featuresin the sense that the number of examples requiredfor it to converge should depend mostly on the num-ber features relevant to the decision, rather than onthe number of potential features.This paper addresses the two issues mentionedabove. It presents a rich set of features that is con-structed using information readily available in thesentence along with shallow parsing and dependencyinformation. It then presents a learning approachthat can use this expressive (and potentially large)intermediate representation and shows that it yieldsa significant improvement in word error rate for thetask of word prediction.The rest of the paper is organized as follows. Insection 2 we formalize the problem, discuss the in-formation sources available to the learning systemand how we use those to construct features. In sec-tion 3 we present the learning approach, based onthe SNoW learning architecture. Section 4 presentsour experimental study and results. In section 4.4we discuss the issue of deciding on a set of candi-date words for each decision. Section 5 concludesand discusses future work.2 Information Sources and FeaturesOur goal is to learn a representation for each wordin terms of features which characterize the syntacticand semantic context in which the word tends toappear. Our features are defined as simple relationsover a collection of predicates that capture (some of)the information available in a sentence.2.1 Information
SourcesDefinition
1
Let s
=<
wl,w2,...,wn > be a sen-tence in which wi is the i-th word. Let :£ be a col-lection of predicates over a sentence s. IS(s)) 1, theInformation source(s) available for the sentences is a representation ors as a list of predicates I E :r,
XS(S) = {II(Wll , ...Wl,),-.., /~g(W~l , ..-Wk,)}.
Ji is the arity of the predicate Ij.
Example 2
Let s be the sentence
< John, X, at, the, clock, to, see, what, time, it, is >
Let ~={word, pos, subj-verb}, with the interpreta-tion that
word
is a unary predicate that returns thevalue of the word in its domain;
pos
is a unarypredicate that returns the value of the pos of theword in its domain, in the context of the sentence;subj- verb is a binary predicate that returns thevalue of the two words in its domain if the second isa verb in the sentence and the first is its subject; itreturns ¢ otherwise. Then,IS(s) = {word(wl)
= John, ..., word(w3) = at,...,
word(wn)
= is,
pos(w4) = DET,...,
s bj - verb(w , w2) = {John, x}...}.
The IS representation of s consists only of the pred-icates with non-empty values. E.g.,
pos(w6) =modal
is not part of the IS for the sentence above.
subj - verb
might not exist at all in the IS even if thepredicate is available, e.g., in The ball was givento Mary.Clearly the
IS
representation of s does not containall the information available to a human reading s;it captures, however, all the input that is availableto the computational process discussed in the restof this paper. The predicates could be generated byany external mechanism, even a learned one. Thisissue is orthogonal to the current discussion.
2.2 Generating Features
Our goal is to learn a representation for each wordof interest. Most efficient learning methods knowntoday and, in particular, those used in NLP, makeuse of a linear decision surface over their featurespace (Roth, 1998; Roth, 1999). Therefore, in or-der to learn expressive representations one needs tocompose complex features as a function of the in-formation sources available. A linear function ex-pressed directly in terms of those will not be expres-sive enough. We now define a language that allows1We denote IS(s) as IS wherever it is obvious what
the
referred
sentence we is, or whenever we want to indicate In-formation
Source in general.
125
 
one to define "types" of features 2 in terms of theinformation sources available to it.Definition 3 (Basic Features)
Let I E Z be ak-ary predicate with range R. Denote w k =(Wjl,... , wjk). We define two basic binary relationsas follows. For a e R we define:
1
iffI(w k)=a
f(I(wk), a) = 0 otherwise
(1)
An existential version of the relation is defined by:
liff3aERs.tI(w
k)=af(I(wk),x) = 0 otherwise
(2)Features, which are defined as binary relations, canbe composed to yield more complex relations interms of the original predicates available in IS.Definition 4 (Composing features)
Let fl, f2be feature definitions. Then
fand(fl,
f2)
for(f1,
f2)
fnot(fl) are defined and given the usual semantic:
liff:----h=l
fand(fl,
f2) = 0
otherwiseliff:=l orf2=lfob(f1,
f2) =
0
otherwise
{ l~ffl=O
fnot(fx) = 0 otherwise
In order to learn with features generated using thesedefinitions as input, it is important that featuresgenerated when applying the definitions on differentISs are given the same identification. In this pre-sentation we assume that the composition operatoralong with the appropriate IS element (e.g., Ex. 2,Ex. 9) are written explicitly as the identification ofthe features. Some of the subtleties in defining theoutput representation are addressed in (Cumby andRoth, 2000).2.3 Structured FeaturesSo far we have presented features as relations over
IS(s)
and allowed for Boolean composition opera-tors. In most cases more information than just a listof active predicates is available. We abstract thisusing the notion of a
structural information source(SIS(s))
defined below. This allows richer class offeature types to be defined.
2We note that we do not define the features will be used inthe learning process. These are going to be defined in a datadriven way given the definitions discussed here and the inputISs. The importance of formally defining the "types" is dueto the fact that some of these are quantified. Evaluating themon a given sentence might be computationally intractable anda formal definition would help to flesh out the difficulties andaid in designing the language (Cumby and Roth, 2000).
2.4 Structured InstancesDefinition 5 (Structural Information Source)
Let s =< wl,w2, ...,Wn >. SIS(s)), the
StructuralInformation source(s)
available for the sentences, is a tuple (s, E1,... ,Ek) of directed acyclicgraphs with s as the set of vertices and Ei 's, a setof edges in s.
Example 6 (Linear Structure)
The simplestSIS is the one corresponding to the linear structureof the sentence. That is, SIS(s) = (s,E) where(wi, wj) E E iff the word wi occurs immediatelybefore wj in the sentence (Figure 1 bottom leftpart).
In a linear structure
(s =< Wl,W2,...,Wn
>,E),where E =
{(wi,wi+l);i = 1, ...n-
1}, we definethe chainc(wj, [l, r]) = {w,_,,..., wj,.., n s.We can now define a new set of features thatmakes use of the structural information. Structuralfeatures are defined using the SIS. When defining afeature, the naming of nodes in s is done relative toa distinguished node, denoted wp, which we call the
focus word
of the feature. Regardless of the arityof the features we sometimes denote the feature fdefined with respect to
wp as f(wp).
Definition 7 (Proximity)
Let SIS(s) = (s, E) bethe linear structure and let I E Z be a k-ary predicatewith range R. Let Wp be a focus word and C =C(wp, [l,
r])
the chain around it. Then, the proximityfeatures for I with respect to the chain C are defined
as:
fc(l(w), a)
= {1 ifI(w) = a,a E R,w E C0 otherwise
(3)
The second type of feature composition definedusing the structure is a collocation operator.Definition 8 (Collocation)
Let fl,...fk be fea-ture definitions, col locc ( f l , f 2, . . . f k ) is a restrictedconjunctive operator that is evaluated on a chainC of length k in a graph. Specifically, let C ={wj,, wj=, .. . , wjk } be a chain of length k in SIS(s).Then, the collocation feature for fl,.., fk with re-spect to the chain C is defined ascollocc(fl, . . . , fk) = {1 ifVi = 1,...k, fi(wj,) = 10 otherwise
(4)
The following example defines features that areused in the experiments described in Sec. 4.
126
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...