Professional Documents
Culture Documents
30
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 30–37,
Avignon, France, April 23 - 27 2012.
2012
c Association for Computational Linguistics
about the decrease of confusion in the text, from after the word alignment process but before the
which a SMT system might be benefited. extraction of translation units and the computa-
After applying relaxed SMT, the resulting tion of model probabilities. The main objective
null tokens in the translated sentences have to be of this relaxation procedure is twofold: on the
replaced by the corresponding words from the set one hand, it attempts to reduce the complexity of
of relaxed words. As relaxed words are chosen the translation task by reducing the size of the
from the top ranked words, which possess high target vocabulary while affecting a large propor-
occurrences in the corpus, their n-gram probabili- tion of the running words in the text; on the other
ties could be reliably trained to serve for word hand, it should also help to reduce model sparse-
prediction. Also, these words are mainly function ness and improve model probability estimates.
words and, from the human perspective, function Of course, different from the Information Re-
words are usually much easier to predict from trieval case, in which stop words are not used at
their neighbor context than content words. Con- all along the search process, in the considered
sider for instance the sentence the house of the machine translation scenario, the removed words
president is very nice. Function words like the, need to be recovered after decoding in order to
of, and is, are certainly easier to be predicted than produce an acceptable translation. The relaxed
content words such as house, president, and nice. word replacement procedure, which is based on
The rest of the paper is organized into four an n-gram based predictor, is implemented as a
sections. In section 2, we discuss the relaxation post-processing step and applied to the relaxed
strategy implemented for a SMT system, which machine translation output in order to produce
generates translation outputs that contain null to- the final translation result.
kens. In section 3, we present the word predic- Our bet here is that the gain in the translation
tion mechanism used to recover the null tokens step, which is derived from the relaxation strate-
occurring in the relaxed translation outputs. In gy, should be enough to compensate the error
section 4, we present and discuss the experimen- rate of the word prediction step, producing, in
tal results. Finally, in section 5 we present the overall, a significant improvement in translation
most relevant conclusion of this work. quality with respect to the non-relaxed baseline
procedure.
2 Relaxation for Machine Translation The next section describes the implemented
word prediction model in detail. It constitutes a
In this paper, relaxation refers to the replacement fundamental element of our proposed relaxed
of the most frequent words in text by a null to- SMT approach.
ken. In the practice, a set of frequent words is
defined and the cardinality of such set is referred 3 Frequent Word Prediction
to as the relaxation order. For example, lets the
relaxation order be two and the two words on the Word prediction has been widely studied and
top rank are the and is. By relaxing the sample used in several different tasks such as, for exam-
sentence previously presented in the introduc- ple, augmented and alternative communication
tion, the following relaxed sentence will be ob- (Wandmacher and Antoine, 2007) and spelling
tained: NULL house of NULL President NULL correction (Thiele et al., 2000). In addition to the
very beautiful. commonly used word n-gram, various language
From some of our preliminary experimental modeling techniques have been applied, such as
results with the EPPS dataset, we did observe the semantic model (Luís and Rosa, 2002; Wand-
that a relaxation order of twenty leaded to a per- macher and Antoine, 2007) and the class-based
plexity reduction of about a 15%. To see whether model (Thiele et al., 2000; Zohar and Roth,
this contributes to improving the translation per- 2000; Ruch et al., 2001).
formance, we trained a translation system by The role of such a word predictor in our con-
relaxing the top ranked words in the vocabulary sidered problem is to recover the null tokens in
of the target language. In this way, there will be a the translation output by replacing them with the
large number of words in the source language words that best fit the sentence. This task is es-
that will be translated to a null token. For exam- sentially a classification problem, in which the
ple: la (the in Spanish) and es (is in Spanish) will most suitable relaxed word for recovering a giv-
be both translated to a null token in English. en null token must be selected. In other words,
This relaxation of terms is only applied to the max
sentence … … , where
target language vocabulary, and it is conducted sentence · is the probabilistic model, e.g. n-
31
gram, that estimates the likelihood of a sentence and the bidirectional n-gram to predict the word
when a null token is recovered with word , in middle, which can be formulated as follows:
drawn from the set of relaxed words . The car- max
|, (4)
dinality || is referred to as the relaxation order,
e.g. || 5 implies that the five most frequent Notice that the backward n-gram model can
words have been relaxed and are candidates to be be estimated from the word counts as:
recovered.
Notice that the word prediction problem in !
| !
(5)
this task is quite different from other works in the
literature. This is basically because the relaxed
words to be predicted in this case are mainly or, it can be also approximated from the forward
function words. Firstly, it may not be effective to n-gram model, as follows:
predict a function word semantically. For exam-
" ! | " | "
ple, we are more certain in predicting equity than | " ! | "
for given the occurrence of share in the sentence. (6)
Secondly, although class-based modeling is com-
monly used for prediction, its original intention
Similarly, the bidirectional n-gram model can
is to tackle the sparseness problem, whereas our be estimated from the word counts:
task focuses only on the most frequent words.
In this preliminary work, our predicting me-
|
chanism is based on an n-gram model. It predicts " | " | "
the word that yields the maximum a posteriori ∑# $ " | " | "
(7)
probability, conditioned on its predecessors. For
the case of the trigram model, it can be expressed or approximated from the forward model:
as follows:
|
max
| (1) " | " | "
∑# $ " | " | "
(8)
Often, there are cases in which more than one
null token occur consecutively. In such cases The word prediction results of using the for-
predicting a null token is conditioned on the pre- ward, backward, and bidirectional n-gram mod-
vious recovered null tokens. To prevent a predic- els will be presented and discussed in the ex-
tion error from being propagated, one possibility perimental section.
is to consider the marginal probability (summed The three n-gram models discussed so far
over the relaxed word set) over the words that predict words based on the local word ordering.
were previously null tokens. For example, if There are two main drawbacks to this: first, only
is a relaxed word, which has been recovered the neighboring words can be used for prediction
from earlier predictions, then the prediction of as building higher order n-gram models is costly;
should no longer be conditioned by . This and, second, prediction may easily fail when con-
can be computed as follows: secutive null tokens occur, especially when all
words conditioning the prediction probability are
max
| recovered null tokens. Hence, instead of predict-
max ∑
| (2) ing words by maximizing the local probability,
predicting words by maximizing a global score
(i.e. a sentence probability in this case), may be a
The traditional n-gram model, as discussed better alternative.
previously, can be termed as the forward n-gram At the sentence level, the word predictor con-
model as it predicts the word ahead. Additional- siders all possible relaxed word permutations and
ly, we also tested the backward n-gram to predict searchs for the one that yields the maximum a
the word behind (i.e. on the left hand side of the posteriori sentence probability. For the trigram
target word), which can be formulated as: model, a relaxed word that maximizes the sen-
tence probability can be predicted as follows:
max
| (3)
max
∏&
' | (9)
32
where, ( is the number of words in the sentence.
Although the forward, backward, and interpo- 40
lated models have been shown to be applicable
Cumulative frequency(%)
35
for local word prediction, they make no differ-
ence at sentence level predictions as they pro- 30
∏&
' | ∏&
' |
15
DEV
(10) 10
5
4 Experimental Results and Discussion 0 5 10 15 20
Word Rank
In this section, we first highlight the Zipfian dis-
tribution in the corpus and the reduction of per- Figure 2. Cumulative relative frequencies of the
plexity after removing the top ranked words. The top ranked words (up to order 20)
n-gram probabilities estimated were then used
for word prediction, and we report the resulting We found that when the most frequent words
prediction accuracy at different relaxation orders. were relaxed from the vocabulary, which means
The performance of the SMT system with a re- being replaced by a null token, the perplexity
laxed vocabulary is presented and discussed in (measured with a trigram model) decreased up to
the last subsection of this section. 15% for the case of a relaxation order of twenty
(Figure 3). This implies that the relaxation causes
4.1 Corpus Analysis the text becoming less confusing, which might
The data used in our experiments is taken from benefit natural language processing tasks such as
the EPPS (Europarl Corpus). We used the ver- machine translation.
sion available through the shared task of the
65
2007's ACL Workshops on Statistical Machine
63 DEV
Translation (Burch et al., 2007). The training set
61
comprises 1.2M sentences with 35M words while TEST
59
the development set and test sets contains 2K
Perplexity
57
sentences each.
55
From the train set, we computed the twenty
53
most frequent words and ranked them according-
51
ly. We found them to be mainly function words.
49
Their counts follow closely a Zipfian distribution
47
(Figure 1) and account for a vast proportion of
45
the text (Figure 2). Indeed, the 40% of the run- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ning words is made up by these twenty words. Relaxation Order
6 DEV
5
4.2 Word Prediction Accuracy
4
3 In order to evaluate the quality of the different
2 prediction strategies, we carried out some expe-
1 riments for replacing null tokens with relaxed
0 words. For this, frequent words were dropped
that
is
it
comma
we
for
to
period
of
and
in
on
i
as
the
be
are
have
this
33
tence probability. In a real application, a text trigram model. These results are shown in Figure
may comprise up to 40% of null tokens that must 5. From the cumulative frequency showed in
be recovered from the twenty top ranked words. Figure 2, we could see that 40% of the words in
For n-gram level prediction (local), we eva- text could now be predicted with an accuracy of
luated word accuracy at different orders of relax- about 77%.
ation. More specifically, we tested the forward
trigram model, the backward trigram model, the 100
Acuuracy (%)
fully with respect to the original words. These
results are shown in Figure 4. Notice that the 70
accuracy of the relaxation of order one is 100%,
so it has not been depicted in the plots. 60
Notice from the figure how the forward and Bigram
backward models performed very alike through- 50
out the different relaxation orders. This can be Trigram
34
computation of the bidirectional model is still rectional bigram/trigram model from scratch is
feasible. In Figure 6 we present a scaled version worth to be considered. As all known language
of the plots to focus on the lower orders for com- model toolkits do not offer this function (even
parison purposes. Although the global bigram the backward n-gram model is built by first re-
prediction makes use of all words in the sentence versing the sentences manually), the discount-
in order to yield the prediction, locally, a given ing/smoothing of the trigram has to be derived.
word is only covered by two consecutive bi- The methods of Good-Turing, Kneser Ney, Ab-
grams. Thus, the prediction of a word does not solute discounting, etc. (Chen and Goodman,
depend on the second word before or the second 1998) can be imitated.
word after. In other words, we could see the bidi-
rectional model as a global model that is applied 4.3 Translation Performance
to a “short segment” (in this case, a three word As frequent words have been ignored in the tar-
segment). The only difference here is that the get language vocabulary of the machine transla-
local bidirectional model estimates the probabili- tion system, the translation outputs will contain a
ties from the corpus and keeps all seen “short great number of null tokens. The amount of null
segment” probabilities in the language model (in tokens should approximate the cumulative fre-
our case, it is derived from forward bigram), quencies shown in Figure 2.
while the global bigram model optimizes the In this experiment, a word predictor was used
probabilities by searching for the best two con- to recover all null tokens in the translation out-
secutive bigrams. The global prediction might puts, and the recovered translations were eva-
only show its advantage when predicting two or luated with the BLEU metric. All BLEU scores
more consecutive null tokens. have been computed up to trigram precision.
The word predictor used was the global tri-
99
Bidirectional gram model, which was the best performing sys-
(local)
Interpolation tem in word prediction experiments previously
97 (local) described. In this case, the predictor was used to
Bigram (global)
recover the null tokens in the translation outputs.
Accuracy (%)
35
Notice how the BLEU score of the baseline sys- step performed consistently for the relaxed SMT
tems are better than those of the relaxed systems. systems, as for the baseline systems (for which
the null tokens were inserted manually into sen-
0.38 tences). Since the word prediction is based on an
DEV
0.37 n-gram model, we may deduce that the relaxed
TEST SMT system preserves the syntactic structure of
0.36
the sentence as the null tokens in the translation
0.35 output could be recovered as accurate as in the
0.34
case of the baseline system.
BLEU
36
versity and the Institute for Infocomm Research,
for their support regarding the development and
publishing of this work.
37