briefly discuss an instance-based method that pro-vides an alternative approach and baseline for ourexperiments.In order to solve the sentence completion problemwith an
N
-gram model, we need to find the mostlikely word sequence
w
t
+1
,...,w
t
+
T
given a word
N
-gram model and an initial sequence
w
1
,...,w
t
(Equation 3). Equation 4 factorizes the joint proba-bility of the missing words; the
N
-th order Markovassumption that underlies the
N
-gram model simpli-fies this expression in Equation 5.
argmax
w
t
+1
,...,w
t
+
T
P
(
w
t
+1
,...,w
t
+
T
|
w
1
,...,w
t
)
(3)
= argmax
w
t
+1
,...,w
t
+
T
T
j
=1
P
(
w
t
+
j
|
w
1
,...,w
t
+
j
−
1
)
(4)
= argmax
T
j
=1
P
(
w
t
+
j
|
w
t
+
j
−
N
+1
,...,w
t
+
j
−
1
)
(5)
TheindividualfactorsofEquation5areprovidedbythemodel. TheMarkovorder
N
hastobalancesuffi-cient context information and sparsity of the trainingdata. A standard solution is to use a weighted linearmixture of
N
-gram models,
1
≤
n
≤
N
, (Brown etal., 1992). We use an EM algorithm to select mixingweights that maximize the generation probability of a tuning set of sentences that have not been used fortraining.We are left with the following questions: (a)how can we decode the most likely completion
effi-ciently
; and (b) how many words should we predict?
4.1 Efficient Prediction
We have to address the problem of finding themost likely completion,
argmax
w
t
+1
,...,w
t
+
T
P
(
w
t
+1
,...,w
t
+
T
|
w
1
,...,w
t
)
efficiently
, eventhough the size of the
search space
grows exponen-tially in the number of predicted words.We will now identify the recursive structure inEquation 3; this will lead us to a Viterbi al-gorithm that retrieves the most likely word se-quence. We first define an auxiliary variable
δ
t,s
(
w
1
,...,w
N
|
w
t
−
N
+2
,...,w
t
)
in Equation 6; itquantifies the greatest possible probability over allarbitrary word sequences
w
t
+1
,...,w
t
+
s
, followedbythewordsequence
w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
N
=
w
N
, conditioned on the initial word sequence
w
t
−
N
+2
,...,w
t
.In Equation 7, we factorize the last transition andutilize the
N
-th order Markov assumption. In Equa-tion 8, we split the maximization and introduce anew random variable
w
0
for
w
t
+
s
. We can now referto the definition of
δ
and see the recursion in Equa-tion 9:
δ
t,s
depends only on
δ
t,s
−
1
and the
N
-grammodel probability
P
(
w
N
|
w
1
,...,w
N
−
1
)
.
δ
t,s
(
w
1
,...,w
N
|
w
t
−
N
+2
,...,w
t
)
(6)
= max
w
t
+1
,...,w
t
+
s
P
(
w
t
+1
,...,w
t
+
s
,w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
N
=
w
N
|
w
t
−
N
+2
,...,w
t
)= max
w
t
+1
,...,w
t
+
s
P
(
w
N
|
w
1
,...,w
N
−
1
)
(7)
P
(
w
t
+1
,...,w
t
+
s
,w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
N
−
1
=
w
N
−
1
|
w
t
−
N
+2
,...,w
t
)= max
w
0
max
w
t
+1
,...,w
t
+
s
−
1
P
(
w
N
|
w
1
,...,w
N
−
1
)
(8)
P
(
w
t
+1
,...,w
t
+
s
−
1
,w
t
+
s
=
w
0
,...,w
t
+
s
+
N
−
1
=
w
N
−
1
|
w
t
−
N
+2
,...,w
t
)= max
w
0
P
(
w
N
|
w
1
,...,w
N
−
1
)
δ
t,s
−
1
(
w
0
,...,w
N
−
1
|
w
t
+
N
−
2
,...,w
t
)
(9)
Exploiting the
N
-th order Markov assumption,we can now express our target probability (Equation3) in terms of
δ
in Equation 10.
max
w
t
+1
,...,w
t
+
T
P
(
w
t
+1
,...,w
t
+
T
|
w
t
−
N
+2
,...,w
t
)
(10)
= max
w
1
,...,w
N
δ
t,T
−
N
(
w
1
,...,w
N
|
w
t
−
N
+2
,...,w
t
)
The last
N
words in the most likely sequenceare simply the
argmax
w
1
,...,w
N
δ
t,T
−
N
(
w
1
,...,w
N
|
w
t
−
N
+2
,...,w
t
)
. In order to collect the precedingmost likely words, we define an auxiliary variable
Ψ
in Equation 11 that can be determined in Equation12. We have now found a Viterbi algorithm that islinear in
T
, the completion length.
Ψ
t,s
(
w
1
,...,w
N
|
w
t
−
N
+2
,...,w
t
)
(11)
= argmax
w
t
+
s
max
w
t
+1
,...,w
t
+
s
−
1
P
(
w
t
+1
,...,w
t
+
s
,w
t
+
s
+1
=
w
1
,...,w
t
+
s
+
N
=
w
N
|
w
t
−
N
+2
,...,w
t
)= argmax
w
0
δ
t,s
−
1
(
w
0
,...,w
N
−
1
|
w
t
−
N
+2
,...,w
t
)
P
(
w
N
|
w
1
,...,w
N
−
1
)
(12)
The Viterbi algorithm starts with the most recentlyentered word
w
t
and moves iteratively into the fu-ture. When the
N
-th token in the highest scored
δ
isa period, then we can stop as our goal is only to pre-dict (parts of) the current sentence. However, since
195
Leave a Comment