Professional Documents
Culture Documents
Ares(2016)4040587 - 01/08/2016
This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”.
This project has received funding from the European Union’s Horizon 2020 program for ICT
under grant agreement no. 645452.
Deliverable D1.5
For copies of reports, updates on project activities and other QT21-related information, contact:
Prof. Stephan Busemann, DFKI GmbH stephan.busemann@dfki.de
Stuhlsatzenhausweg 3 Phone: +49 (681) 85775 5286
66123 Saarbrücken, Germany Fax: +49 (681) 85775 5338
Copies of reports and other material can also be accessed via the project’s homepage:
http://www.qt21.eu/
Page 2 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Contents
1 Executive Summary 4
5 Improved BEER 6
References 8
Appendices 10
Page 3 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
1 Executive Summary
This deliverable reports on the progress in Task 1.3 Improved Learning for Machine Translation,
as specified in the project proposal:
Task 1.3 Improved Learning for Machine Translation [M01–M36] (CUNI, DFKI,
RWTH, UVA) Experiments performed within this task will address the second ob-
jective, namely full structured prediction, with discriminative and integrated train-
ing. Focus will be on better correlation of the training criterion with the target
metrics, avoidance of overfitting by incorporating certain so far unused smoothing
techniques into the training process, and efficient algorithm to allow the use of all
training data in all phases of the training process.
The first focus point of Task 1.3 is constructing full structured predictions. We analyzed two
different approaches to achieve this. RWTH shows in Section 2 how to model bilingual sentence
pairs on a word level together with reordering information as one sequence – joint translation
and reordering (JTR) sequences. Paired with a phrased-based machine translation system it
gave a significant improvement of up to 2.2 Bleu points over the baseline. RWTH proposes
in Section 3 an alignment-based neural machine translation approach as an alternative to the
popular attention-based approach. They demonstrate competitive results on the IWSLT 2013
German→English and BOLT Chinese→English tasks.
Beyond systems designed within a unified framework with a single search for the best trans-
lation, we also experiment with a high-level combination of systems, which gives us the best
possible performance. In this joint effort the groups from the QT21 and the HimL project
managed to build the best system for the English→Romanian translation task of the ACL 2016
First Conference on Machine Translation (WMT 2016); see Section 4.
These systems depend on automatic ways to optimize parameters with a good training
criterion. The next focus of this Task is therefore to find a better correlation of the training
criterion with the target metrics. To this end, we organized and also participated in the Tuning
Task at WMT 2016. The organization of the task itself fits in WP4 and the details of the
task as a whole will be described in the corresponding deliverable there. In this deliverable, we
include our submissions to the Tuning Task. The Beer evaluation metric for MT was improved
by the UvA team to be both higher quality and significantly faster to compute, which makes
it competitive with Bleu for tuning; see Section 5. Another submission to the Tuning Task
was contributed by CUNI, who based the standard Mert on Particle Swarm Optimization; see
Section 6. Additionally RWTH introduced a character level TER, where the edit distance is
calculated on character level, while the shift edit is performed on word level. It was not tested
for tuning yet, but shows very high correlation with human judgment as described in Section 7.
Even with better metrics in place we still need to avoid overfitting to rare events while
learning from them. We developed multiple ways to avoid this by applying smoothing in different
variations. One approach is the usage of bag-of-words input for feed-forward neural networks
with individually-trained decay rates, which gave comparable results to LSTM networks as
described in Section 8 by RWTH. The second approach from RWTH is based on smoothing the
standard phrase translation probability by reducing the vocabulary with a word-label mapping.
That empirically showed that smoothing is not significantly affected by the choice of vocabulary
and is more effective for large-scale translation tasks; see Section 9.
Section 10 presents work by DFKI on smoothing using word classes for morphologically-
rich languages. These languages have large vocabularies, which increase training times and
data sparsity. The research developed multiple techniques to improve both the scalability and
quality of word clusters for word alignment.
Page 4 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
• count based n-gram models with modified Kneser-Ney smoothing, a well-tested technique
for language modeling
• feedforward neural networks (FFNN), a model that should generalize better for unseen
events
• recurrent neural networks (RNN), good generalization, and can in principle take the com-
plete history into account
Comparisons between the count-based JTR model and the operation sequence model (OSM)
Durrani et al. (2013), both used in phrase-based decoding, showed that the JTR model per-
formed at least as good as OSM, with a slight advantage for JTR. In comparison to the OSM, the
JTR model operates on words, leading to a smaller vocabulary size. Moreover, it utilizes simpler
reordering structures without gaps and only requires one log-linear feature to be tuned, whereas
the OSM needs five. The strongest combination of count and neural network models yields an
improvement over the phrase-based system by up to 2.2 Bleu points on the German→English
IWSLT task. This combination also outperforms OSM by up to 1.2 Bleu points on the BOLT
Chinese→English tasks.
This work appeared as a long paper at EMNLP 2015 (Guta et al., 2015) and is available in
Appendix A.
Page 5 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
5 Improved BEER
Beer is a trained evaluation metric with a linear model that combines features capturing
character n-grams and permutation trees introduced by UvA. This year the Beer learning
algorithm has been improved (linear SVM instead of logistic regression) and some features that
are relatively slow to compute were removed (paraphrasing, syntax and permutation trees)
which resulted in a very large speed-up. This speed-up was essential for fast tuning of MT
systems: tuning with Beer is now as fast as tuning with Bleu. The implementation is for
some languages now even more accurate. An additional change in Beer is that the usual
training for ranking is replaced by a compromise: the initial model is trained for ranking (RR)
with ranking SVM and then the output from SVM is scaled using a trained regression model
to approximate absolute judgment (DA).
This work on Beer appeared at WMT 2015 (Stanojević and Sima’an, 2015) and is available
in Appendix D.
Page 6 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
adjust a hypothesis, until it completely matches the reference, normalized by the length of the
hypothesis sentence. CharacTER calculates the character level edit distance while performing
the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is
considered to match a reference word and could be shifted, if the edit distance between them
is below a threshold value. The Levenshtein distance between the reference and the shifted
hypothesis sequence is computed at a character level. Also, the lengths of hypothesis sequences,
rather than reference sequences, are used for normalizing the edit distance, which effectively
counters the issue that shorter translations normally achieve lower TER.
The experimental results and evaluations showed that CharacTER represents a metric with
high human correlations on the system-level, especially for morphologically rich languages,
which benefit from the character level information. It outperforms other strong metrics for
translations out of English direction, while the concept is simple and straightforward.
This work was done in cooperation with WP2 and will appear at WMT 2016 (Wang et al.,
2016) and is available in Appendix F.
Page 7 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
References
Tamer Alkhouli, Gabriel Bretschner, Jan-Thorsten Peter, Mohammed Hethnawi, Andreas Guta,
and Hermann Ney. 2016. Alignment-based neural machine translation. In ACL 2016 First
Conference on Machine Translation. Berlin, Germany.
Jon Dehdari, Liling Tan, and Josef van Genabith. 2016. BIRA: Improved predictive exchange
word clustering. In Proceedings of the 2016 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies (NAACL).
Association for Computational Linguistics, San Diego, CA, USA, pages 1169–1174.
Nadir Durrani, Alexander Fraser, Helmut Schmid, Hieu Hoang, and Philipp Koehn. 2013. Can
Markov models over minimal translation units help phrase-based SMT? In Proceedings of
the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers). Sofia, Bulgaria, pages 399–405.
Russ C Eberhart, James Kennedy, et al. 1995. A new optimizer using particle swarm theory.
In Proceedings of the sixth international symposium on micro machine and human science.
New York, NY, volume 1, pages 39–43.
Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Joern Wuebker, and Hermann Ney. 2015.
A comparison between count and neural network models based on joint translation and re-
ordering sequences. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing. Lisbon, Portugal, pages 1401–1411.
Yunsu Kim, Andreas Guta, Joern Wuebker, and Hermann Ney. 2016. A comparative study on
vocabulary reduction for phrase table smoothing. In ACL 2016 First Conference on Machine
Translation. Berlin, Germany.
Viktor Kocur and Ondřej Bojar. 2016. Particle Swarm Optimization Submission for WMT16
Tuning Task. In Proceedings of the First Conference on Machine Translation, Volume 2:
Shared Task Papers. Association for Computational Linguistics, Berlin, Germany, pages 515–
521.
Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In
Proc. of the Association for Computational Linguistics. Sapporo, Japan.
Jan-Thorsten Peter, Tamer Alkhouli, Hermann Ney, Matthias Huck, Fabienne Braune, Alexan-
der Fraser, Aleš Tamchyna, Ondřej Bojar, Barry Haddow, Rico Sennrich, Frédéric Blain, Lu-
cia Specia, Jan Niehues, Alex Waibel, Alexandre Allauzen, Lauriane Aufrant, Franck Burlot,
Page 8 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Elena Knyazeva, Thomas Lavergne, François Yvon, Stella Frank, and Marcis Pinnis. 2016a.
The QT21/HimL combined machine translation system. In ACL 2016 First Conference on
Machine Translation. Berlin, Germany.
Jan-Thorsten Peter, Weiyue Wang, and Hermann Ney. 2016b. Exponentially decaying bag-of-
words input features for feed-forward neural network in statistical machine translation. In
Annual Meeting of the Assoc. for Computational Linguistics. Berlin, Germany.
Miloš Stanojević and Khalil Sima’an. 2015. Evaluating MT systems with BEER. The Prague
Bulletin of Mathematical Linguistics 104:17–26.
Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. CharacTER:
Translation edit rate on character level. In ACL 2016 First Conference on Machine Transla-
tion. Berlin, Germany.
Page 9 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 10 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
ral networks (NNs), as they have been successfully words and do not include reordering information.
applied to machine translation recently (Sunder- Durrani et al. (2011) developed the OSM which
meyer et al., 2014; Devlin et al., 2014). They are combined dependencies on bilingual word pairs
able to score any word combination without re- and reordering information into a single frame-
quiring additional smoothing techniques. We ex- work. It used an own decoder that was based on n-
periment with feed-forward and recurrent trans- grams of MTUs and predicted single translation or
lation networks, benefiting from their smoothing reordering operations. This was further advanced
capabilities. To this end, we split the linear se- in (Durrani et al., 2013a) by a decoder that was
quence into two sequences for the neural transla- capable of predicting whole sequences of MTUs,
tion models to operate on. This is possible due to similar to a phrase-based decoder. In (Durrani et
the simplicity of the JTR sequence. We show that al., 2013b), a slightly enhanced version of OSM
the count and NN models perform well on their was integrated into the log-linear framework of
own, and that combining them yields even better the Moses system (Koehn et al., 2007). Both the
results. BILM (Stewart et al., 2014) and the OSM (Durrani
In this work, we apply n-gram models with et al., 2014) can be smoothed using word classes.
modified Kneser-Ney smoothing during phrase- Guta et al. (2015) introduced the extended trans-
based decoding and neural JTR models in rescor- lation model (ETM), which operates on the word
ing. However, using a phrase-based system is not level and augments the IBM models by an addi-
required by the model, but only the initial step to tional bilingual word pair and a reordering opera-
demonstrate the strength of JTR models, which tion. It is implemented into the log-linear frame-
can be applied independently of the underlying de- work of a phrase-based decoder and shown to be
coding framework. While the focus of this work is competitive with a 7-gram OSM.
on the development and comparison of the models, The JTR n-gram models proposed within this
the long-term goal is to decode using JTR mod- work can be seen as an extension of the ETM.
els without the limitations introduced by phrases, Nevertheless, JTR models utilize linear sequences
in order to exploit the full potential of JTR mod- of dependencies and combine the translation of
els. The JTR models are estimated on word align- bilingual word pairs and reoderings into a sin-
ments, which we obtain using GIZA++ in this pa- gle model. The ETM, however, features separate
per. The future aim is to also generate improved models for the translation of individual words and
word alignments by a joint optimization of both reorderings and provides an explicit treatment of
the alignments and the models, similar to the train- multiple alignments. As they operate on linear se-
ing of IBM models (Brown et al., 1990; Brown et quences, JTR count models can be implemented
al., 1993). In the long run, we intend to achieve a using existing toolkits for n-gram language mod-
consistency between decoding and training using els, e.g. the KenLM toolkit (Heafield et al., 2013).
the introduced JTR models. An HMM approach for word-to-phrase align-
ments was presented in (Deng and Byrne, 2005),
2 Previous Work
showing performance similar to IBM Model 4 on
In order to address the downsides of the phrase the task of bitext alignment. Feng et al. (2013)
translation model, various approaches have been propose several models which rely only on the in-
taken. Mariño et al. (2006) proposed a bilingual formation provided by the source side and pre-
language model (BILM) that operates on bilin- dict reorderings. Contrastingly, JTR models in-
gual n-grams, with an own n-gram decoder re- corporate target information as well and predict
quiring monotone alignments. The lexical re- both translations and reorderings jointly in a sin-
ordering model introduced in (Tillmann, 2004) gle framework.
was integrated into phrase-based decoding. Crego Zhang et al. (2013) explore different Markov
and Yvon (2010) adapted the approach to BILMs. chain orderings for an n-gram model on MTUs
The bilingual n-grams are further advanced in in rescoring. Feng and Cohn (2013) present an-
(Niehues et al., 2011), where they operate on non- other generative word-based Markov chain trans-
monotone alignments within a phrase-based trans- lation model which exploits a hierarchical Pitman-
lation framework. Compared to our JTR models, Yor process for smoothing, but it is only applied
their BILMs treat jointly aligned source words as to induce word alignments. Their follow-up work
minimal translation units, ignore unaligned source (Feng et al., 2014) introduces a Markov-model on
Page 11 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
MTUs, similar to the OSM described above. Algorithm 1 JTR Conversion Algorithm
Recently, neural machine translation has 1: procedure JTR CONVERSION( f1J , eI1 , bI1 )
emerged as an alternative to phrase-based decod- 2: gK1←0 /
ing, where NNs are used as standalone models to 3: // last translated source position j0
4: 0
j ←0
decode source input. In (Sutskever et al., 2014), 5: for i ← 1 to I do
a recurrent NN was used to encode a source 6: if ei is unaligned then
sequence, and output a target sentence once the 7: // align ei to the empty word ε
8: APPEND (gK 1 , hε, ei i)
source sentence was fully encoded in the network. 9: continue
The network did not have any explicit treatment 10: // ei is aligned to at least one source word
of alignments. Bahdanau et al. (2015) introduced 11: j ← first source position in bi
12: if j = j0 then
soft alignments as part of the network architecture. 13: // ei is aligned to the same f j as ei−1
In this work, we make use of hard alignments 14: APPEND (gK 1 , hσ , ei i)
instead, where we encode the alignments in the 15: continue
16: 0
if j 6= j + 1 then
source and target sequences, requiring no mod- 17: // alignment step is non-monotone
ifications of existing feed-forward and recurrent 18: REORDERINGS ( f 1J , bI1 , gK 0
1 , j , j)
NN architectures. Our feed-forward models are 19: // 1-to-1 translation: f j is aligned to ei
based on the architectures proposed in (Devlin et 20: APPEND (gK 1 , h f j , ei i)
al., 2014), while the recurrent models are based 21: j0 ← j
22: // generate all other f j that are also
on (Sundermeyer et al., 2014). Further recent 23: // aligned to the current target word ei
research on applying NN models for extended 24: for all remaining j in bi do
25: APPEND (gK 1 , h f j , σ i)
context was carried out in (Le et al., 2012; Auli
26: j0 ← j
et al., 2013; Hu et al., 2014). All of these works 27: // check last alignment step at sentence end
focus on lexical context and ignore the reordering 28: if j0 6= J then
aspect covered in our work. 29: // last alignment step is non-monotone
REORDERINGS ( f 1J , bI1 , gK 0
30: 1 , j , J + 1)
3 JTR Sequences 31: return gK 1
32:
The core idea of this work is the interpretation of 33: // called when a reordering class is appended
34: procedure REORDERINGS( f1J , bI1 , gK 0
a bilingual sentence pair and its word alignment 1 , j , j)
35: // check if the predecessor is unaligned
as a linear sequence of K joint translation and re- 36: if f j−1 is unaligned then
ordering (JTR) tokens gK1 . Formally, the sequence 37: // get unaligned predecessors
j−1
gK1 ( f1J , eI1 , bI1 ) is a uniquely defined interpretation 38: f j0 ← unaligned predecessors of f j
39: // check if the alignment step to the first
of a given source sentence f1J , its translation eI1 and 40: // unaligned predecessor is monotone
the inverted alignment bI1 , where bi denotes the 41: if j0 6= j0 + 1 then
ordered sequence of source positions j aligned to 42: // non-monotone: add reordering class
43: APPEND (gK 1 , ∆ j0 , j0 )
target position i. We drop the explicit mention of
44: // translate unaligned predecessors by ε
( f1J , eI1 , bI1 ) to allow for a better readability. Each 45: for f ← f j0 to f j−1 do
JTR token is either an aligned bilingual word pair 46: APPEND (gK 1 , h f , εi)
h f , ei or a reordering class ∆ j0 j . 47: else
Unaligned words on the source and target side 48: // non-monotone: add reordering class
49: APPEND (gK 1 , ∆ j0 , j )
are processed as if they were aligned to the empty
word ε. Hence, an unaligned source word f gener-
ates the token h f , εi, and an unaligned target word
e the token hε, ei. words. Similar to Feng and Cohn (2013), we clas-
Each word of the source and target sentences is sify the reordered source positions j0 and j by ∆ j0 j :
to appear in the corresponding JTR sequence ex-
actly once. For multiply-aligned target words e,
the first source word f that is aligned to e gener- step backward (←),
j = j0 − 1
ates the token h f , ei. All other source words f 0 , ∆ j0 j = jump forward (y), j > j0 + 1
that are also aligned to e, are processed as if they
jump backward (x), j < j0 − 1.
were aligned to the artificial word σ . Thus, each
of these f 0 generates a token h f 0 , σ i. The same
approach is applied to multiply-aligned source The reordering classes are illustrated in Figure 1.
Page 12 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
i
i i−1
i i−1
i−1
j j0 j0 j j j0
(a) step backward (←) (b) jump forward (y) (c) jump backward (x)
Page 13 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 14 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Due to the design of the JTR sequence, pro- layer with the target input tk−1 as well, that is,
ducing the source and target JTR sequences is we aggregate the embeddings of the input source
straightforward. The resulting sequences can then word sk and the input target word tk−1 before they
be used with existing NN architectures, without are fed into the forward layer. Due to recurrency,
further modifications to the design of the net- the forward layer encodes the parts (t1k−1 , sk1 ), and
works. This results in powerful models that re- the backward layer encodes sKk , and together they
quire little effort to implement. encode (t1k−1 , sK1 ), which is used to score the out-
put target word tk . For the sake of comparison
4.1 Feed-forward Neural JTR to FFNN and count models, we also experiment
First, we will apply a feed-forward NN (FFNN) to with a recurrent model that does not include future
the JTR sequence. FFNN models resemble count- source information, this is obtained by replacing
based models in using a predefined limited context the term sK1 with sk1 in Eq. 5. It will be referred
size, but they do not encounter the same smooth- to as the unidirectional recurrent neural network
ing problems. In this work, we use a FFNN similar (URNN) model in the experiments.
to that proposed in (Devlin et al., 2014), defined Note that the JTR source and target sides
as: include jump information, therefore, the RNN
K model described above explicitly models reorder-
p(t1K |sK1 ) ≈ ∏ p(tk |tk−n
k−1 k
, sk−n ). (4) ing. In contrast, the models proposed in (Sunder-
k=1
meyer et al., 2014) do not include any jumps, and
It scores the JTR target word tk at position k us- hence do not provide an explicit way of includ-
ing the current source word sk , and the history of ing word reordering. In addition, the JTR RNN
n JTR source words. In addition, the n JTR target models do not require the use of IBM-1 lexica to
words preceding tk are used as context. The FFNN resolve multiply-aligned words. As discussed in
computes the score by looking up the vector em- Section 3, these cases are resolved by aligning the
beddings of the source and target context words, multiply-aligned word to the first word on the op-
concatenating them, then evaluating the rest of the posite side.
network. We reduce the output layer to a short- The integration of the NNs into the decoder is
list of the most frequent words, and compute word not trivial, due to the dependence on the target
class probabilities for the remaining words. context. In the case of RNNs, the context is un-
bounded, which would affect state recombination,
4.2 Recurrent Neural JTR and lead to less variety in the beam used to prune
Unlike feed-forward NNs, recurrent NNs (RNNs) the search space. Therefore, the RNN scores are
enable the use of unbounded context. Following computed using approximations instead (Auli et
(Sundermeyer et al., 2014), we use bidirectional al., 2013; Alkhouli et al., 2015). In (Alkhouli et
recurrent NNs (BRNNs) to capture the full JTR al., 2015), it is shown that approximate RNN inte-
source side. The BRNN uses the JTR target side gration into the phrase-based decoder has a slight
as well as the full JTR source side as context, and advantage over n-best rescoring. Therefore, we
it is given by: apply RNNs in rescoring in this work, and to al-
K low for a direct comparison between FFNNs and
p(t1K |sK1 ) = ∏ p(tk |t1k−1 , sK1 ) (5) RNNs, we apply FFNNs in rescoring as well.
k=1
Page 15 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 2: Statistics for the bilingual training data of the IWSLT 2013 German→English, WMT 2015
German→English, and the DARPA BOLT Chinese→English translation tasks.
(Och and Ney, 2003). We use a standard phrase- 5.1 Tasks description
based translation system (Koehn et al., 2003). The domain of IWSLT consists of lecture-type
The decoding process is implemented as a beam talks presented at TED conferences which are also
search. All baselines contain phrasal and lexical available online4 . All systems are optimized on
smoothing models for both directions, word and the dev2010 corpus, named dev here. Some
phrase penalties, a distance-based reordering of the OSM and JTR systems are trained on the
model, enhanced low frequency features (Chen TED portions of the data containing 138K sen-
et al., 2011), a hierarchical reordering model tences. To estimate the 4-gram LM, we addi-
(HRM) (Galley and Manning, 2008), a word tionally make use of parts of the Shuffled News,
class LM (Wuebker et al., 2013) and an n-gram LDC English Gigaword and 109 -French-English
LM. The lexical and phrase translation models of corpora, selected by a cross-entropy difference cri-
all baseline systems are trained on all provided terion (Moore and Lewis, 2010). In total, 1.7 bil-
bilingual data. The log-linear feature weights are lion running words are taken for LM training. The
tuned with minimum error rate training (MERT) BOLT Chinese→English task is evaluated on the
(Och, 2003) on B LEU (Papineni et al., 2001). All “discussion forum” domain. The 5-gram LM is
systems are evaluated with MultEval (Clark et al., trained on 2.9 billion running words in total. The
2011). The reported B LEU scores are averaged in-domain data consists of a subset of 67.8K sen-
over three MERT optimization runs. tences and we used a set of 1845 sentences for tun-
ing. The evaluation set test1 contains 1844 and
test2 1124 sentences. For the WMT task, we
All LMs, OSMs and count-based JTR models
used the target side of the bilingual data and all
are estimated with the KenLM toolkit (Heafield et
monolingual data to train a pruned 5-gram LM on
al., 2013). The OSM and the count-based JTR
a total of 4.4 billion running words. We concate-
model are implemented in the phrasal decoder.
nated the newstest2011 and newstest2012
NNs are used only in rescoring. The 9-gram
corpora for tuning the systems.
FFNNs are trained with two hidden layers. The
short lists contain the 10k most frequent words, 5.2 Results
and all remaining words are clusterd into 1000
word classes. The projecton layer has 17 × 100 We start with the IWSLT 2013 German→ English
nodes, the first hidden layer 1000 and the sec- task, where we compare between the different JTR
ond 500. The RNNs have LSTM architectures. and OSM models. The results are shown in Ta-
The URNN has 2 hidden layers while the BRNN ble 3. When comparing the in-domain n-gram
has one forward, one backward and one addi- JTR model trained using Kneser-Ney smoothing
tional hidden layer. All layers have 200 nodes, (KN) to OSM, we observe that the n-gram KN
while the output layer is class-factored using 2000 JTR model improves the baseline by 1.4 B LEU
classes. For the count-based JTR model and OSM on both test and eval11. The OSM model
we tuned the n-gram size on the tuning set of each performs similarly, with a slight disadvantage on
task. For the full data, 7-grams were used for the eval11. In comparison, the FFNN of Eq. (4) im-
IWSLT and WMT tasks, and 8-grams for BOLT. proves the baseline by 0.7–0.9 B LEU, compared to
When using in-domain data, smaller n-gram sizes the slightly better 0.8–1.1 B LEU achieved by the
were used. All rescoring experiments used 1000- URNN. The difference between the FFNN and the
best lists without duplicates. 4 http://www.ted.com/
Page 16 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
data dev test eval11 part of the source input when scoring target words.
This information is not used by the KN model.
baseline full 33.3 30.8 35.7 Moreover, the BRNN is able to score word com-
+OSM TED 34.5 32.2 36.8 binations unseen in training, while the KN model
+FFNN TED 34.0 31.7 36.4 uses backing off to score unseen events.
+URNN TED 34.2 31.9 36.5 When training the KN, FFNN, and OSM mod-
+BRNN TED 34.4 32.1 36.8 els on the full data, we observe less gains in com-
+KN TED 34.6 32.2 37.1 parison to in-domain data training. However, com-
+BRNN TED 35.0 32.8 37.7 bining the KN models trained on in-domain and
+OSM full 34.1 31.6 36.5 full data gives additional gains, which suggests
+FFNN full 33.9 31.5 36.0 that although the in-domain model is more adapted
+KN full 34.2 31.6 36.6 to the task, it still can gain from out-of-domain
data. Adding the FFNN on top improves the com-
+KN TED 34.9 32.4 37.1 bination. Note here that the FFNN sees the same
+FFNN TED 35.2 32.7 37.2 information as the KN model, but the difference is
+FFNN full 35.1 32.7 37.2 that the NN operates on the word level rather than
+BRNN TED 35.5 33.0 37.4 the word-pair level. Second, the FFNN is able to
+BRNN TED 35.4 33.0 37.3 handle unseen sequences by design, without the
need for the backing off workaround. The BRNN
Table 3: Results measured in B LEU for the IWSLT improves the combination more than the FFNN,
German→English task. as the model captures an unbounded source and
target history in addition to an unbounded future
source context. Combining the KN, FFNN and
train data test1 test2
BRNN JTR models leads to an overall gain of 2.2
baseline 18.1 17.0 B LEU on both dev and test.
+OSM indomain 18.8 17.2 Next, we present the BOLT Chinese→English
+FFNN indomain 18.6 17.6 results, shown in Table 4. Comparing n-gram
+BRNN indomain 18.6 17.6 KN JTR and OSM trained on the in-domain data
+KN indomain 18.8 17.5 shows they perform equally well on test1, im-
proving the baseline by 0.7 B LEU, with a slight ad-
+OSM full 18.5 17.2
vantage for the JTR model on test2. The feed-
+FFNN full 18.4 17.4
forward and the recurrent in-domain networks
+KN full 18.8 17.3
yield the same results in comparison to each other.
+KN indomain 19.0 17.7 Training the OSM and JTR models on the full data
+FFNN full 19.2 18.3 yields slightly worse results than in-domain train-
+RNN indomain 19.3 18.4 ing. However, combining the two types of training
improves the results. This is shown when adding
Table 4: Results measured in B LEU for the BOLT the in-domain KN JTR model on top of the model
Chinese→English task. trained on full data, improving it by up to 0.4
B LEU. Rescoring with the feed-forward and the
recurrent network improves this even further, sup-
URNN is that the latter captures the unbounded porting the previous observation that the n-gram
source and target history that extends until the be- KN JTR and NNs complement each other. The
ginning of the sentences, giving it an advantage combination of the 4 models yields an overall im-
over the FFNN. The performance of the URNN provement of 1.2–1.4 B LEU.
can be improved by including the future part of the Finally, we compare KN JTR and OSM models
source sentence, as described in Eq. (5), resulting on the WMT German→English task in Table 5.
in the BRNN model. Next, we explore whether the The two models perform almost similar to each
models are additive. When rescoring the n-gram other. The JTR model improves the baseline by
KN JTR output with the BRNN, an additional im- up to 0.7 B LEU. Rescoring the KN JTR with the
provement of 0.6 B LEU is obtained. There are two FFNN improves it by up to 0.3 B LEU leading to an
reasons for this: The BRNN includes the future overall improvement between 0.5 and 1.0 B LEU.
Page 17 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
newstest 6 Conclusion
2013 2014 2015
We introduced a method that converts bilingual
baseline 28.1 28.6 29.4 sentence pairs and their word alignments into joint
+OSM 28.6 28.9 30.0 translation and reordering (JTR) sequences. They
+FFNN 28.7 28.9 29.7 combine interdepending lexical and alignment de-
+KN 28.8 28.9 29.9 pendencies into a single framework. A main ad-
+FFNN 29.1 29.1 30.0 vantage of JTR sequences is that a variety of mod-
els can be trained on them. Here, we have esti-
mated n-gram models with modified Kneser-Ney
Table 5: Results measured in B LEU for the WMT smoothing, FFNN and RNN architectures on JTR
German→English task. sequences.
We compared our count-based JTR model to the
OSM, both used in phrase-based decoding, and
5.3 Analysis showed that the JTR model performed at least as
good as OSM, with a slight advantage for JTR. In
To investigate the effect of including jump infor- comparison to the OSM, the JTR model operates
mation in the JTR sequence, we trained a BRNN on words, leading to a smaller vocabulary size.
using jump classes and another excluding them. Moreover, it utilizes simpler reordering structures
The BRNNs were used in rescoring. Below, we without gaps and only requires one log-linear fea-
demonstrate the difference between the systems: ture to be tuned, whereas the OSM needs 5. Due
to the flexibility of JTR sequences, we can ap-
source: wir kommen später noch auf diese Leute zurück . ply them also to FFNNs and RNNs. Utilizing
reference: We’ll come back to these people later . two count models and applying both networks in
Hypothesis 1: rescoring gains the overall highest improvement
JTR source: wir kommen δ zurück δ später noch auf over the phrase-based system by up to 2.2 B LEU,
diese Leute δ . on the German→English IWSLT task. The com-
JTR target: we come y back x later σ to these people bination outperforms OSM by up to 1.2 B LEU on
y. the BOLT Chinese→English tasks.
Hypothesis 2: The JTR models are not dependent on the
JTR source: wir kommen später noch auf diese Leute phrase-based framework, and one of the long-
zurück . term goals is to perform standalone decoding with
JTR target: we come later σ on these guys back . the JTR models independently of phrase-based
systems. Without the limitations introduced by
phrases, we believe that JTR models could per-
Note the German verb “zurückkommen”, which form even better. In addition, we aim to use JTR
is split into “kommen” and “zurück”. German models to obtain the alignment, which would then
places “kommen” at the second position and be used to train the JTR models in an iterative
“zurück” towards the end of the sentence. Unlike manner, achieving consistency and hoping for im-
German, the corresponding English phrase “come proved models.
back” has the words adjacent to each other. We
found that the system including jumps prefers the Acknowledgements
correct translation of the verb, as shown in Hy-
pothesis 1 above. The system translates “kom- This work has received funding from the Euro-
men” to “come”, jumps forward to “zurück”, pean Union’s Horizon 2020 research and innova-
translates it to “back”, then jumps back to continue tion programme under grant agreement no 645452
translating the word “später”. In contrast, the sys- (QT21). This material is partially based upon
tem that excludes jump classes is blind to this sep- work supported by the DARPA BOLT project un-
aration of words. It favors Hypothesis 2 which is der Contract No. HR0011- 12-C-0015. Any opin-
a strictly monotone translation of the German sen- ions, findings and conclusions or recommenda-
tence. This is also reflected by the B LEU scores, tions expressed in this material are those of the
where we found the system including jump classes authors and do not necessarily reflect the views of
outperforming the one without by up to 0.8 B LEU. DARPA.
Page 18 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Nadir Durrani, Helmut Schmid, and Alexander Fraser.
gio. 2015. Neural machine translation by jointly 2011. A joint sequence translation model with in-
learning to align and translate. In International Con- tegrated reordering. In Proceedings of the 49th An-
ference on Learning Representations, San Diego, nual Meeting of the Association for Computational
Calefornia, USA, May. Linguistics: Human Language Technologies, pages
1045–1054, Portland, Oregon, USA, June.
Peter F. Brown, John Cocke, Stephan A. Della Pietra,
Vincent J. Della Pietra, Fredrick Jelinek, John D. Nadir Durrani, Alexander Fraser, and Helmut Schmid.
Lafferty, Robert L. Mercer, and Paul S. Rossin. 2013a. Model with minimal translation units, but
1990. A Statistical Approach to Machine Transla- decode with phrases. In Proceedings of the 2013
tion. Computational Linguistics, 16(2):79–85, June. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Peter F. Brown, Stephan A. Della Pietra, Vincent J. Language Technologies, pages 1–11, Atlanta, Geor-
Della Pietra, and Robert L. Mercer. 1993. The gia, June.
Mathematics of Statistical Machine Translation: Pa-
rameter Estimation. Computational Linguistics, Nadir Durrani, Alexander Fraser, Helmut Schmid,
19(2):263–311, June. Hieu Hoang, and Philipp Koehn. 2013b. Can
markov models over minimal translation units help
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa phrase-based smt? In Proceedings of the 51st An-
Bentivogli, and Marcello Federico. 2014. Report on nual Meeting of the Association for Computational
the 11th iwslt evaluation campaign, iwslt 2014. In Linguistics (Volume 2: Short Papers), pages 399–
International Workshop on Spoken Language Trans- 405, Sofia, Bulgaria, August.
lation, pages 2–11, Lake Tahoe, CA, USA, Decem-
ber. Nadir Durrani, Philipp Koehn, Helmut Schmid, and
Alexander Fraser. 2014. Investigating the useful-
Stanley F. Chen and Joshuo Goodman. 1998. An
ness of generalized word representations in smt. In
Empirical Study of Smoothing Techniques for Lan-
COLING, Dublin, Ireland, August.
guage Modeling. Technical Report TR-10-98, Com-
puter Science Group, Harvard University, Cam-
Yang Feng and Trevor Cohn. 2013. A markov
bridge, MA, August.
model of machine translation using non-parametric
Boxing Chen, Roland Kuhn, George Foster, and bayesian inference. In 51st Annual Meeting of the
Howard Johnson. 2011. Unpacking and transform- Association for Computational Linguistics, pages
ing feature functions: New ways to smooth phrase 333–342, Sofia, Bulgaria, August.
tables. In MT Summit XIII, pages 269–275, Xiamen,
China, September. Minwei Feng, Jan-Thorsten Peter, and Hermann Ney.
2013. Advancements in reordering models for sta-
Jonathan H. Clark, Chris Dyer, Alon Lavie, and tistical machine translation. In Annual Meeting
Noah A. Smith. 2011. Better hypothesis test- of the Assoc. for Computational Linguistics, pages
ing for statistical machine translation: Controlling 322–332, Sofia, Bulgaria, August.
for optimizer instability. In 49th Annual Meet-
ing of the Association for Computational Linguis- Yang Feng, Trevor Cohn, and Xinkai Du. 2014. Fac-
tics:shortpapers, pages 176–181, Portland, Oregon, tored markov translation with robust modeling. In
June. Proceedings of the Eighteenth Conference on Com-
putational Natural Language Learning, pages 151–
Josep Maria Crego and François Yvon. 2010. Improv- 159, Ann Arbor, Michigan, June.
ing reordering with linguistically informed bilingual
n-grams. In Proceedings of the 23rd International Michel Galley and Christopher D. Manning. 2008.
Conference on Computational Linguistics (Coling A simple and effective hierarchical phrase reorder-
2010: Posters), pages 197–205, Beijing, China. ing model. In Proceedings of the Conference on
Page 19 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Empirical Methods in Natural Language Process- Franz J. Och, Christoph Tillmann, and Hermann Ney.
ing, EMNLP ’08, pages 848–856, Stroudsburg, PA, 1999. Improved Alignment Models for Statistical
USA. Association for Computational Linguistics. Machine Translation. In Proc. Joint SIGDAT Conf.
on Empirical Methods in Natural Language Pro-
Andreas Guta, Joern Wuebker, Miguel Graça, Yunsu cessing and Very Large Corpora, pages 20–28, Uni-
Kim, and Hermann Ney. 2015. Extended translation versity of Maryland, College Park, MD, June.
models in phrase-based decoding. In Proceedings
of the EMNLP 2015 Tenth Workshop on Statistical Franz Josef Och. 2003. Minimum Error Rate Training
Machine Translation, Lisbon, Portugal, September. in Statistical Machine Translation. In Proc. of the
to appear. 41th Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 160–167, Sapporo,
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Japan, July.
Clark, and Philipp Koehn. 2013. Scalable modi-
fied Kneser-Ney language model estimation. In Pro- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
ceedings of the 51st Annual Meeting of the Associa- Jing Zhu. 2001. Bleu: a Method for Automatic
tion for Computational Linguistics, pages 690–696, Evaluation of Machine Translation. IBM Research
Sofia, Bulgaria, August. Report RC22176 (W0109-022), IBM Research Di-
vision, Thomas J. Watson Research Center, P.O. Box
Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao. 218, Yorktown Heights, NY 10598, September.
2014. Minimum translation modeling with recur-
rent neural networks. In Proceedings of the 14th Darelene Stewart, Roland Kuhn, Eric Joanis, and
Conference of the European Chapter of the Associ- George Foster. 2014. Coarse split and lump bilin-
ation for Computational Linguistics, pages 20–29, gual languagemodels for richer source information
Gothenburg, Sweden, April. in smt. In AMTA, Vancouver, BC, Canada, October.
P. Koehn, F. J. Och, and D. Marcu. 2003. Statisti- Martin Sundermeyer, Tamer Alkhouli, Wuebker Wue-
cal Phrase-Based Translation. In Proceedings of the bker, and Hermann Ney. 2014. Translation Model-
2003 Meeting of the North American chapter of the ing with Bidirectional Recurrent Neural Networks.
Association for Computational Linguistics (NAACL- In Conference on Empirical Methods on Natural
03), pages 127–133, Edmonton, Alberta. Language Processing, pages 14–25, Doha, Qatar,
October.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi, Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le.
Brooke Cowan, Wade Shen, Christine Moran, 2014. Sequence to sequence learning with neural
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra networks. In Advances in Neural Information Pro-
Constantine, and Evan Herbst. 2007. Moses: Open cessing Systems 27, pages 3104–3112.
Source Toolkit for Statistical Machine Translation.
pages 177–180, Prague, Czech Republic, June. Christoph Tillmann. 2004. A unigram orientation
model for statistical machine translation. In Pro-
Hai Son Le, Alexandre Allauzen, and François Yvon. ceedings of HLT-NAACL 2004: Short Papers, HLT-
2012. Continuous Space Translation Models with NAACL-Short ’04, pages 101–104, Stroudsburg,
Neural Networks. In Conference of the North Amer- PA, USA.
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
39–48, Montreal, Canada, June. mann Ney. 2013. Improving statistical machine
translation with word class models. In Conference
José B Mariño, Rafael E Banchs, Josep M Crego, Adrià on Empirical Methods in Natural Language Pro-
de Gispert, Patrik Lambert, José A R Fonollosa, and cessing, pages 1377–1381, Seattle, USA, October.
Marta R Costa-jussà. 2006. N-gram-based Machine
Translation. Comput. Linguist., 32(4):527–549, De- Richard Zens, Franz Josef Och, and Hermann Ney.
cember. 2002. Phrase-Based Statistical Machine Transla-
tion. In 25th German Conf. on Artificial Intelligence
R.C. Moore and W. Lewis. 2010. Intelligent Selection (KI2002), pages 18–32, Aachen, Germany, Septem-
of Language Model Training Data. In ACL (Short ber.
Papers), pages 220–224, Uppsala, Sweden, July.
Hui Zhang, Kristina Toutanova, Chris Quirk, and Jian-
Jan Niehues, Teresa Herrmann, Stephan Vogel, and feng Gao. 2013. Beyond left-to-right: Multiple de-
Alex Waibel, 2011. Proceedings of the Sixth Work- composition structures for smt. In Proceedings of
shop on Statistical Machine Translation, chapter the 2013 Conference of the North American Chap-
Wider Context by Using Bilingual Language Mod- ter of the Association for Computational Linguis-
els in Machine Translation, pages 198–206. tics: Human Language Technologies, pages 12–21,
Atlanta, Georgia, June.
Franz J. Och and Hermann Ney. 2003. A System-
atic Comparison of Various Statistical Alignment
Models. Computational Linguistics, 29(1):19–51,
March.
Page 20 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 21 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
models on the IWSLT 2013 German→English and tasks like IWSLT English→German 2015 (Luong
BOLT Chinese→English task. and Manning, 2015).
In this work, we follow the same standalone
1.1 Motivation
neural translation approach. However, we have
Attention-based NMT computes the translation a different treatment of alignments. While the
probability depending on an intermediate compu- attention-based soft-alignment model computes
tation of an alignment distribution. The alignment an alignment distribution as an intermediate step
distribution is used to choose the positions in the within the neural model, we follow the hard align-
source sentence that the decoder attends to dur- ment concept used in phrase extraction. We sepa-
ing translation. Therefore, the alignment model rate the alignment model from the lexical model,
can be considered as an implicit part of the trans- and train them independently. At translation time,
lation model. On the other hand, separating the the decoder hypothesizes and scores the alignment
alignment model from the lexical model has its path in addition to the translation.
own advantages: First, this leads to more flexi- Cohn et al. (2016) introduce several modifi-
bility in modeling and training: not only can the cations to the attention-based model inspired by
models be trained separately, but they can also traditional word alignment concepts. They mod-
have different model types, e.g. neural models, ify the network architecture, adding a first-order
count-based models, etc. Second, the separation dependence by making the attention vector com-
avoids propagating errors from one model to the puted for a target position directly dependent on
other. In attention-based systems, the translation that of the previous position. Our alignment model
score is based on the alignment distribution, which has a first-order dependence that takes place at the
risks propagating errors from the alignment part to input and output of the model, rather than an ar-
the translation part. Third, using separate models chitectural modification of the neural network.
makes it possible to assign them different weights.
Yang et al. (2013) use NN-based lexical and
We exploit this and use a log-linear framework
alignment models, but they give up the probabilis-
to combine them. We still retain the possibility
tic interpretation and produce unnormalized scores
of joint training, which can be performed flexibly
instead. Furthermore, they model alignments us-
by alternating between model training and align-
ing a simple distortion model that has no depen-
ment generation. The latter can be performed us-
dence on lexical context. The models are used to
ing forced-decoding.
produce new alignments which are in turn used
In contrast to the count-based models used in to train phrase systems. This leads to no sig-
HMMs, we use neural models, which allow cov- nificant difference in terms of translation perfor-
ering long context without having to explicitly ad- mance. Tamura et al. (2014) propose a lexicalized
dress the smoothing problem that arises in count- RNN alignment model. The model still produces
based models. non-probabilistic scores, and is used to generate
word alignments used to train phrase-based sys-
2 Related Work
tems. In this work, we develop a feed-forward
Most recently, NNs have been trained on large neural alignment model that computes probabilis-
amounts of data, and applied to translate indepen- tic scores, and use it directly in standalone de-
dent of the phrase-based framework. Sutskever et coding, without constraining it to the phrase-based
al. (2014) introduced the pure encoder-decoder ap- framework. In addition, we use the neural models
proach, which avoids the concept of word align- to produce alignments that are used to re-train the
ments. Bahdanau et al. (2015) introduced an atten- same neural models.
tion mechanism to the encoder-decoder approach, Schwenk (2012) proposed a feed-forward net-
allowing the decoder to attend to certain source work that computes phrase scores offline, and the
words. This method was refined in (Luong et al., scores were added to the phrase table of a phrase-
2015) to allow for local attention, which makes the based system. Offline phrase scoring was also
decoder attend to representations of source words done in (Alkhouli et al., 2014) using semantic
residing within a window. These translation mod- phrase features obtained using simple neural net-
els have shown competitive results, outperforming works. In comparison, our work does not rely on
phrase-based systems when using ensembles on the phrase-based system, rather, the neural net-
Page 22 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 23 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Computing normalized probabilities is done us- Note that we also use the same alignment heuris-
ing the softmax function, which requires comput- tics presented in Section 4.1. As this variant does
ing the full output layer first, and then comput- not require future alignment information, it can be
ing the normalization factor by summing over the applied in decoding. However, in this work we
output scores of the full vocabulary. This is very apply this model in rescoring and leave decoder
costly for large vocabularies. To overcome this, integration to future work.
we adopt the class-factored output layer consisting
of a class layer and a word layer (Goodman, 2001; 4.3 Feed-forward Alignment Model
Morin and Bengio, 2005). The model in this case We propose a neural alignment model to score
is defined as alignment paths. Instead of predicting the abso-
b̂i +m lute positions in the source sentence, we model
p(ei |ei−1
i−n , f )=
b̂i −m the jumps from one source position to the next po-
p(ei |c(ei ), ei−1 b̂i +m
) · p(c(ei )|ei−1 b̂i +m sition to be translated. The jump at target posi-
i−n , f b̂i −m i−n , f b̂i −m
)
tion i is defined as ∆i = b̂i − b̂i−1 , which cap-
where c denotes a word mapping that assigns each tures the jump from the source position b̂i−1 to b̂i .
target word to a single class, where the number We modify the FFNN lexical model to obtain a
of classes is chosen to be much smaller than the feed-forward alignment model. The feed-forward
vocabulary size |C| << |V |. Even though the alignment model (FFAM) is given by
full class layer needs to be computed, only a sub-
set of the significantly-larger word layer has to be b̂i−1 +m
p(bi |bi−1 i−1 J i−1
1 , e1 , f1 ) = p(∆i |ei−n , f ) (2)
b̂i−1 −m
considered, namely the words that share the same
class c(ei ) with the target word ei . This helps This is a lexicalized alignment model condi-
speeding up training on large-vocabulary tasks. tioned on the n-gram target history and the
4.2 Bidirectional Joint Model (2m + 1)-gram source window. Note that, dif-
ferent from the FFJM, the source window of this
The bidirectional RNN joint model (BJM) pre-
model is centered around the source position b̂i−1 .
sented in (Sundermeyer et al., 2014a) is another
This is because the model needs to predict the
lexical model. The BJM uses the full source sen-
jump to the next source position b̂i to be translated.
tence and the full target history for prediction, and
The alignment model architecture is shown in Fig-
it is computed by reordering the source sentence
ure 2.
following the target order. This requires the com-
In contrast to the lexical model, the output vo-
plete alignment information to compute the model
cabulary of the alignment model is much smaller,
scores. Here, we introduce a variant of the model
and therefore we use a regular softmax output
that is conditioned on the alignment history in-
layer for this model without class-factorization.
stead of the full alignment path. This is achieved
by computing forward and backward representa- 4.4 Feed-forward vs. Recurrent Models
tions of the source sentence in its original order,
RNNs have been shown to outperform feed-
as done in (Bahdanau et al., 2015). The model is
forward variants in language and translation mod-
given by
eling. Nevertheless, feed-forward networks have
p(ei |bi1 , ei−1 J i i−1 J
1 , f1 ) = p(ei |b̂1 , e1 , f1 ) their own advantages: First, they are typically
Page 24 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 25 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 26 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 27 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 28 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
speed-up factor
25
3 attention-based RNN 14.8 76.1 13.6 76.9
BLEU[%]
28.5
4 +fine-tuning 16.1 73.1 15.4 72.3 20
5 FFJM+dp+wp 10.1 77.2 9.8 75.8 15
28
6 FFJM+FFAM+wp 14.4 71.9 13.7 71.3
7 +fine-tuning 15.8 70.3 15.4 69.4 10
8 +BJM Rescoring 16.0 70.3 15.8 69.5 27.5
9 BJM+FFAM+wp+fine-tuning 16.0 70.4 15.7 69.7 5
27 0
1 10 100 1000
Table 3: BOLT Chinese→English results in B LEU
classes
[%] and T ER [%].
.
Table 4: Re-alignment results in B LEU [%] and factories
T ER [%] on the IWSLT 2013 German→English coal
in-domain data. Each system includes FFJM, other
of
FFAM and word penalty. lot
a
build
trained using GIZA++ alignments. We train new to
was
models using the re-aligned data and compare the proposal
translation quality before and after re-alignment. the
and
We use 0.7 and 0.3 as model weights for the FFJM
, chla
wa
zu
Vo
un
de
ba
vi en
we e
Ko tere
Fa le
br
d
rs
r
i
h
i
Page 29 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 5: Sample translations from the IWSLT German→English test set using the attention-based
system (Table 2, row #4) and our system (Table 2, row #7). We highlight the (pre-ordered) source words
and their aligned target words. We underline the source words of interest, italicize correct translations,
and use bold-face for incorrect translations.
as alignment. While we use larger vocabularies T ER. We also demonstrate that re-aligning the
compared to the attention-based system, we ob- training data using the neural decoder yields better
serve incorrect translations of rare words. E.g., translation quality.
the German word Ölknappheit in sentence 3 oc- As future work, we aim to integrate alignment-
curs only 7 times in the training data among 108M based RNNs such as the BJM into the alignment-
words, and therefore it is an unknown word for based decoder. We also plan to develop a bidirec-
the attention system. Our system has the word in tional RNN alignment model to make jump deci-
the source vocabulary but fails to predict the right sions based on unbounded context. In addition, we
translation. Another problem occurs in sentence want to investigate the use of coverage constraints
4, where the German verb “zurückgehen” is split in alignment-based NMT. Furthermore, we con-
into “gehen ... zurück”. Since the feed-forward sider the re-alignment experiment promising and
model uses a source window of size 9, it cannot plan to apply re-alignment on the full bilingual
include both words when it is centered at any of data of each task.
them. Such insufficient context might be resolved
when integrating the bidirectional RNN in decod- Acknowledgments
ing. Note that the attention-based model also fails This paper has received funding from the Euro-
to produce the correct translation here. pean Union’s Horizon 2020 research and innova-
tion programme under grant agreement no 645452
9 Conclusion (QT21). Tamer Alkhouli was partly funded by the
2016 Google PhD Fellowship for North America,
This work takes a step towards bridging the gap
Europe and the Middle East.
between conventional word alignment concepts
and NMT. We use an HMM-inspired factoriza-
tion of the lexical and alignment models, and em- References
ploy the Viterbi alignments obtained using con-
Tamer Alkhouli, Andreas Guta, and Hermann Ney.
ventional HMM/IBM models to train neural mod- 2014. Vector space models for phrase-based ma-
els. An alignment-based decoder is introduced chine translation. In EMNLP 2014 Eighth Work-
and a log-linear framework is used to combine the shop on Syntax, Semantics and Structure in Statis-
models. We use MERT to tune the model weights. tical Translation, pages 1–10, Doha, Qatar, October.
Our system outperforms the attention-based sys- Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015.
tem on the German→English task by up to 0.9% Investigations on phrase-based decoding with recur-
B LEU, and on Chinese→English by up to 2.8% rent neural network language and translation mod-
Page 30 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
els. In Proceedings of the EMNLP 2015 Tenth Work- model. In Proceedings of the Conference on Empiri-
shop on Statistical Machine Translation, pages 294– cal Methods in Natural Language Processing, pages
303, Lisbon, Portugal, September. 848–856, Honolulu, Hawaii, USA, October.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Joshua Goodman. 2001. Classes for fast maximum en-
gio. 2015. Neural machine translation by jointly tropy training. In Proceedings of the IEEE Interna-
learning to align and translate. In International Con- tional Conference on Acoustics, Speech, and Signal
ference on Learning Representations, San Diego, Processing, volume 1, pages 561–564, Utah, USA,
Calefornia, USA, May. May.
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Jo-
James Bergstra, Ian J. Goodfellow, Arnaud Berg- ern Wuebker, and Hermann Ney. 2015. A Com-
eron, Nicolas Bouchard, and Yoshua Bengio. 2012. parison between Count and Neural Network Mod-
Theano: new features and speed improvements. De- els Based on Joint Translation and Reordering Se-
cember. quences. In Conference on Empirical Methods on
James Bergstra, Olivier Breuleux, Frédéric Bastien, Natural Language Processing, pages 1401–1411,
Pascal Lamblin, Razvan Pascanu, Guillaume Des- Lisbon, Portugal, September.
jardins, Joseph Turian, David Warde-Farley, and
Yoshua Bengio. 2010. Theano: a CPU and GPU Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao.
math expression compiler. In Proceedings of the 2014. Minimum translation modeling with recur-
Python for Scientific Computing Conference, Austin, rent neural networks. In Proceedings of the 14th
TX, USA, June. Conference of the European Chapter of the Associ-
ation for Computational Linguistics, pages 20–29,
Stanley F. Chen and Joshua Goodman. 1998. An Gothenburg, Sweden, April.
Empirical Study of Smoothing Techniques for Lan-
guage Modeling. Technical Report TR-10-98, Com- Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
puter Science Group, Harvard University, Cam- and Yoshua Bengio. 2015. On using very large tar-
bridge, MA, August. get vocabulary for neural machine translation. In
Proceedings of the Annual Meeting of the Associa-
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- tion for Computational Linguistics, pages 1–10, Bei-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger jing, China, July.
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using RNN encoder–decoder Reinhard Kneser and Hermann Ney. 1995. Improved
for statistical machine translation. In Proceedings of backing-off for M-gram language modeling. In Pro-
the 2014 Conference on Empirical Methods in Nat- ceedings of the International Conference on Acous-
ural Language Processing, pages 1724–1734, Doha, tics, Speech, and Signal Processing, volume 1, pages
Qatar, October. 181–184, May.
Jonathan H. Clark, Chris Dyer, Alon Lavie, and Philipp Koehn, Franz J. Och, and Daniel Marcu.
Noah A. Smith. 2011. Better hypothesis testing for 2003. Statistical Phrase-Based Translation. In Pro-
statistical machine translation: Controlling for opti- ceedings of the 2003 Meeting of the North Ameri-
mizer instability. In 49th Annual Meeting of the As- can chapter of the Association for Computational
sociation for Computational Linguistics, pages 176– Linguistics, pages 127–133, Edmonton, Canada,
181, Portland, Oregon, June. May/June.
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy-
molova, Kaisheng Yao, Chris Dyer, and Gholamreza Hai Son Le, Alexandre Allauzen, and François Yvon.
Haffari. 2016. Incorporating structural alignment 2012. Continuous Space Translation Models with
biases into an attentional neural translation model. Neural Networks. In Conference of the North Amer-
In Proceedings of the 2016 Conference of the North ican Chapter of the Association for Computational
American Chapter of the Association for Computa- Linguistics: Human Language Technologies, pages
tional Linguistics: Human Language Technologies, 39–48, Montreal, Canada, June.
pages 876–885, San Diego, California, June.
Minh-Thang Luong and Christopher D. Manning.
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas 2015. Stanford neural machine translation systems
Lamar, Richard Schwartz, and John Makhoul. 2014. for spoken language domains. In Proceedings of the
Fast and Robust Neural Network Joint Models for International Workshop on Spoken Language Trans-
Statistical Machine Translation. In 52nd Annual lation, pages 76–79, Da Nag, Vietnam, December.
Meeting of the Association for Computational Lin-
guistics, pages 1370–1380, Baltimore, MD, USA, Minh-Thang Luong, Hieu Pham, and Christopher D.
June. Manning. 2015. Effective approaches to attention-
based neural machine translation. In Conference on
Michel Galley and Christopher D. Manning. 2008. A Empirical Methods in Natural Language Process-
simple and effective hierarchical phrase reordering ing, pages 1412–1421, Lisbon, Portugal, September.
Page 31 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
R.C. Moore and W. Lewis. 2010. Intelligent Selec- Conference on Empirical Methods on Natural Lan-
tion of Language Model Training Data. In Proceed- guage Processing, pages 14–25, Doha, Qatar, Octo-
ings of the 48th Annual Meeting of the Association ber.
for Computational Linguistics, pages 220–224, Up-
psala, Sweden, July. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
2014b. rwthlm - the RWTH Aachen university neu-
Frederic Morin and Yoshua Bengio. 2005. Hierarchi- ral network language modeling toolkit. In Inter-
cal probabilistic neural network language model. In speech, pages 2093–2097, Singapore, September.
Proceedings of the international workshop on artifi-
cial intelligence and statistics, pages 246–252, Bar- Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le.
bados, January. 2014. Sequence to sequence learning with neu-
ral networks. In Advances in Neural Information
Franz Josef Och. 2003. Minimum Error Rate Train- Processing Systems 27, pages 3104–3112, Montréal,
ing in Statistical Machine Translation. In Proceed- Canada, December.
ings of the 41th Annual Meeting of the Association
for Computational Linguistics, pages 160–167, Sap- Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita.
poro, Japan, July. 2014. Recurrent neural networks for word align-
ment model. In 52nd Annual Meeting of the Asso-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ciation for Computational Linguistics, pages 1470–
Jing Zhu. 2002. Bleu: a Method for Automatic 1480, Baltimore, MD, USA.
Evaluation of Machine Translation. In Proceed-
ings of the 41st Annual Meeting of the Associa- Bart van Merriënboer, Dzmitry Bahdanau, Vincent Du-
tion for Computational Linguistics, pages 311–318, moulin, Dmitriy Serdyuk, David Warde-Farley, Jan
Philadelphia, Pennsylvania, USA, July. Chorowski, and Yoshua Bengio. 2015. Blocks
and fuel: Frameworks for deep learning. CoRR,
Stephan Peitz, Arne Mauser, Joern Wuebker, and Her- abs/1506.00619.
mann Ney. 2012. Forced derivations for hierarchi-
Ashish Vaswani, Yinggong Zhao, Victoria Fossum,
cal machine translation. In International Confer-
and David Chiang. 2013. Decoding with large-
ence on Computational Linguistics, pages 933–942,
scale neural language models improves translation.
Mumbai, India, December.
In Proceedings of the 2013 Conference on Empiri-
Maja Popović and Hermann Ney. 2006. POS-based cal Methods in Natural Language Processing, pages
word reorderings for statistical machine transla- 1387–1392, Seattle, Washington, USA, October.
tion. In Language Resources and Evaluation, pages
David Vilar, Daniel Stein, Matthias Huck, and Her-
1278–1283, Genoa, Italy, May.
mann Ney. 2010. Jane: Open source hierarchi-
Holger Schwenk. 2012. Continuous Space Translation cal translation, extended with reordering and lexi-
Models for Phrase-Based Statistical Machine Trans- con models. In ACL 2010 Joint Fifth Workshop on
lation. In 25th International Conference on Com- Statistical Machine Translation and Metrics MATR,
putational Linguistics, pages 1071–1080, Mumbai, pages 262–270, Uppsala, Sweden, July.
India, December. Joern Wuebker, Arne Mauser, and Hermann Ney.
2010. Training phrase translation models with
Rico Sennrich, Barry Haddow, and Alexandra Birch.
leaving-one-out. In Proceedings of the Annual
2016. Improving neural machine translation models
Meeting of the Association for Computational Lin-
with monolingual data. In Proceedings of 54th An-
guistics, pages 475–484, Uppsala, Sweden, July.
nual Meeting of the Association for Computational
Linguistics, August. Joern Wuebker, Matthias Huck, Stephan Peitz, Malte
Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Mansour, and Hermann Ney. 2012. Jane 2: Open
nea Micciulla, and John Makhoul. 2006. A Study of source phrase-based and hierarchical statistical ma-
Translation Edit Rate with Targeted Human Annota- chine translation. In International Conference on
tion. In Proceedings of the 7th Conference of the As- Computational Linguistics, pages 483–491, Mum-
sociation for Machine Translation in the Americas, bai, India, December.
pages 223–231, Cambridge, Massachusetts, USA,
August. Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Neng-
hai Yu. 2013. Word alignment modeling with con-
Andreas Stolcke. 2002. SRILM – An Extensible text dependent deep neural network. In 51st Annual
Language Modeling Toolkit. In Proceedings of the Meeting of the Association for Computational Lin-
International Conference on Speech and Language guistics, pages 166–175, Sofia, Bulgaria, August.
Processing, volume 2, pages 901–904, Denver, CO,
September.
Page 32 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 33 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
mented in Jane, RWTH’s open source statistical table is adapted to the SETimes2 corpus (Niehues
machine translation toolkit (Freitag et al., 2014a). and Waibel, 2012). The system uses a pre-
The Jane system combination is a mature imple- reordering technique (Rottmann and Vogel, 2007)
mentation which previously has been successfully in combination with lexical reordering. It uses two
employed in other collaborative projects and for word-based n-gram language models and three ad-
different language pairs (Freitag et al., 2013; Fre- ditional non-word language models. Two of them
itag et al., 2014b; Freitag et al., 2014c). are automatic word class-based (Och, 1999) lan-
In the remainder of the paper, we present the guage models, using 100 and 1,000 word classes.
technical details of the QT21/HimL combined ma- In addition, we use a POS-based language model.
chine translation system and the experimental re- During decoding, we use a discriminative word
sults obtained with it. The paper is structured lexicon (Niehues and Waibel, 2013) as well.
as follows: We describe the common preprocess- We rescore the system output using a 300-best
ing used for most of the individual engines in list. The weights are optimized on the concate-
Section 2. Section 3 covers the characteristics nation of the development data and the SETimes2
of the different individual engines, followed by dev set using the ListNet algorithm (Niehues et al.,
a brief overview of our system combination ap- 2015). In rescoring, we add the source discrimina-
proach (Section 4). We then summarize our empir- tive word lexica (Herrmann et al., 2015) as well as
ical results in Section 5, showing that we achieve neural network language and translation models.
better translation quality than with any individual These models use a factored word representation
engine. Finally, in Section 6, we provide a sta- of the source and the target. On the source side
tistical analysis of certain linguistic phenomena, we use the word surface form and two automatic
specifically the prediction precision on morpho- word classes using 100 and 1,000 classes. On the
logical attributes. We conclude the paper with Romanian side, we add the POS information as an
Section 7. additional word factor.
Page 34 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
more specifically its implementation in XenC training data using the Berkeley parser (Petrov et
(Rousseau, 2013). As a result, one third of the ini- al., 2006). For model prediction during tuning and
tial corpus is removed. Finally, we make a linear decoding, we use parsed versions of the develop-
interpolation of these models, using the SRILM ment and test sets. We train the rule selection
toolkit (Stolcke, 2002). model using VW and tune the weights of the trans-
lation model using batch MIRA (Cherry and Fos-
3.3 LMU-CUNI ter, 2012). The 5-gram language model is trained
The LMU-CUNI contribution is a constrained using KenLM (Heafield et al., 2013) on the Roma-
Moses phrase-based system. It uses a simple fac- nian part of the Common Crawl corpus concate-
tored setting: our phrase table produces not only nated with the Romanian part of the training data.
the target surface form but also its lemma and mor-
phological tag. On the input, we include lemmas, 3.5 RWTH Aachen University: Hierarchical
POS tags and information from dependency parses Phrase-based System
(lemma of the parent node and syntactic relation),
all encoded as additional factors. The RWTH hierarchical setup uses the open
The main difference from a standard phrase- source translation toolkit Jane 2.3 (Vilar et al.,
based setup is the addition of a feature-rich dis- 2010). Hierarchical phrase-based translation
criminative translation model which is condi- (HPBT) (Chiang, 2007) induces a weighted syn-
tioned on both source- and target-side context chronous context-free grammar from parallel text.
(Tamchyna et al., 2016). The motivation for us- In addition to the contiguous lexical phrases, as
ing this model is to better condition lexical choices used in phrase-based translation (PBT), hierar-
by using the source context and to improve mor- chical phrases with up to two gaps are also ex-
phological and topical coherence by modeling the tracted. Our baseline model contains models
(limited left-hand side) target context. with phrase translation probabilities and lexical
We also take advantage of the target factors smoothing probabilities in both translation direc-
by using a 7-gram language model trained on se- tions, word and phrase penalty, and enhanced low
quences of Romanian morphological tags. Finally, frequency features (Chen et al., 2011). It also
our system also uses a standard lexicalized re- contains binary features to distinguish between hi-
ordering model. erarchical and non-hierarchical phrases, the glue
rule, and rules with non-terminals at the bound-
3.4 LMU aries. We use the cube pruning algorithm (Huang
The LMU system integrates a discriminative rule and Chiang, 2007) for decoding.
selection model into a hierarchical SMT system, The system uses three backoff language models
as described in (Tamchyna et al., 2014). The rule (LM) that are estimated with the KenLM toolkit
selection model is implemented using the high- (Heafield et al., 2013) and are integrated into the
speed classifier Vowpal Wabbit2 which is fully in- decoder as separate models in the log-linear com-
tegrated in Moses’ hierarchical decoder. During bination: a full 4-gram LM (trained on all data),
decoding, the rule selection model is called at each a limited 5-gram LM (trained only on in-domain
rule application with syntactic context information data), and a 7-gram word class language model
as feature templates. The features are the same as (wcLM) (Wuebker et al., 2013) trained on all data
used by Braune et al. (2015) in their string-to-tree and with a output vocabulary of 143K words.
system, including both lexical and soft source syn- The system produces 1000-best lists which are
tax features. The translation model features com- reranked using a LSTM-based (Hochreiter and
prise the standard hierarchical features (Chiang, Schmidhuber, 1997; Gers et al., 2000; Gers et al.,
2005) with an additional feature for the rule se- 2003) language model (Sundermeyer et al., 2012)
lection model (Braune et al., 2016). and a LSTM-based bidirectional joined model
Before training, we reduce the number of trans- (BJM) (Sundermeyer et al., 2014a). The mod-
lation rules using significance testing (Johnson et els have a class-factored output layer (Goodman,
al., 2007). To extract the features of the rule se- 2001; Morin and Bengio, 2005) to speed up train-
lection model, we parse the English part of our ing and evaluation. The language model uses 3
2
http://hunch.net/˜vw/ (VW). Implemented by stacked LSTM layers, with 350 nodes each. The
John Langford and many others. BJM has a projection layer, and computes a for-
Page 35 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
ward recurrent state encoding the source and target URLs, e-mail addresses, etc.). During translation
history, a backward recurrent state encoding the a rule-based localisation feature is applied.
source future, and a third LSTM layer to combine
them. All layers have 350 nodes. The neural net- 3.8 Edinburgh/LMU Hierarchical System
works are implemented using an extension of the
The UEDIN-LMU HPBT system is a hierarchi-
RWTHLM toolkit (Sundermeyer et al., 2014b).
cal phrase-based machine translation system (Chi-
The parameter weights are optimized with MERT
ang, 2005) built jointly by the University of Ed-
(Och, 2003) towards the B LEU metric.
inburgh and LMU Munich. The system is based
3.6 RWTH Neural System on the open source Moses implementation of the
hierarchical phrase-based paradigm (Hoang et al.,
The second system provided by the RWTH is an
2009). In addition to a set of standard features in a
attention-based recurrent neural network similar
log-linear combination, a number of non-standard
to (Bahdanau et al., 2015). The implementation
enhancements are employed to achieve improved
is based on Blocks (van Merriënboer et al., 2015)
translation quality.
and Theano (Bergstra et al., 2010; Bastien et al.,
2012). Specifically, we integrate individual language
The network uses the 30K most frequent words models trained over the separate corpora (News
on the source and target side as input vocabulary. Crawl 2015, Europarl, SETimes2) directly into
The decoder and encoder word embeddings are of the log-linear combination of the system and let
size 620. The encoder uses a bidirectional layer MIRA (Cherry and Foster, 2012) optimize their
with 1024 GRUs (Cho et al., 2014) to encode the weights along with all other features in tuning,
source side, while the decoder uses 1024 GRU rather than relying on a single linearly interpolated
layer. language model. We add another background lan-
The network is trained for up to 300K updates guage model estimated over a concatenation of all
with a minibatch size of 80 using Adadelta (Zeiler, Romanian corpora including Common Crawl. All
2012). The network is evaluated every 10000 up- language models are unpruned.
dates on B LEU and the best network on the news- For hierarchical rule extraction, we impose less
dev2016/1 dev set is selected as the final network. strict extraction constraints than the Moses de-
The monolingual News Crawl 2015 corpus is faults. We extract more hierarchical rules by al-
translated into English with a simple phrase-based lowing for a maximum of ten symbols on the
translation system to create additional parallel source side, a maximum span of twenty words,
training data. The new data is weighted by us- and no lower limit to the amount of words cov-
ing the News Crawl 2015 corpus (2.3M sentences) ered by right-hand side non-terminals at extraction
once, the Europarl corpus (0.4M sentences) twice time. We discard rules with non-terminals on their
and the SETimes2 corpus (0.2M sentences) three right-hand side if they are singletons in the train-
times. The final system is an ensemble of 4 net- ing data.
works, all with the same configuration and training In order to promote better reordering decisions,
settings. we implemented a feature in Moses that resem-
bles the phrase orientation model for hierarchical
3.7 Tilde machine translation as described by Huck et al.
The Tilde system is a phrase-based machine trans- (2013) and extend our system with it. The model
lation system built on LetsMT infrastructure (Vasi- scores orientation classes (monotone, swap, dis-
jevs et al., 2012) that features language-specific continuous) for each rule application in decoding.
data filtering and cleaning modules. Tilde’s sys- We finally follow the approach outlined by
tem was trained on all available parallel data. Huck et al. (2011) for lightly-supervised train-
Two language models are trained using KenLM ing of hierarchical systems. We automatically
(Heafield, 2011): 1) a 5-gram model using the translate parts (1.2M sentences) of the monolin-
Europarl and SETimes2 corpora, and 2) a 3-gram gual Romanian News Crawl 2015 corpus to En-
model using the Common Crawl corpus. We also glish with a Romanian→English phrase-based sta-
apply a custom tokenization tool that takes into tistical machine translation system (Williams et
account specifics of the Romanian language and al., 2016). The foreground phrase table extracted
handles non-translatable entities (e.g., file paths, from the human-generated parallel data is filled
Page 36 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
up with entries from a background phrase table built from only News Crawl 2015, with singleton
extracted from the automatically produced News 3-grams and above pruned out. The weights of
Crawl 2015 parallel data. all these features and models are tuned with k-best
Huck et al. (2016) give a more in-depth descrip- MIRA (Cherry and Foster, 2012) on first the half
tion of the Edinburgh/LMU hierarchical machine of newsdev2016. In decoding, we use MBR (Ku-
translation system, along with detailed experimen- mar and Byrne, 2004), cube-pruning (Huang and
tal results. Chiang, 2007) with a pop-limit of 5000, and the
Moses ”monotone at punctuation” switch (to pre-
3.9 Edinburgh Neural System vent reordering across punctuation) (Koehn and
Edinburgh’s neural machine translation system Haddow, 2009).
is an attentional encoder-decoder (Bahdanau et
3.11 USFD Phrase-based System
al., 2015), which we train with nematus.3 We
use byte-pair-encoding (BPE) to achieve open- USFD’s phrase-based system is built using the
vocabulary translation with a fixed vocabulary of Moses toolkit, with MGIZA (Gao and Vogel,
subword symbols (Sennrich et al., 2016c). We 2008) for word alignment and KenLM (Heafield
produce additional parallel training data by auto- et al., 2013) for language model training. We use
matically translating the monolingual Romanian all available parallel data for the translation model.
News Crawl 2015 corpus into English (Sennrich A single 5-gram language model is built using all
et al., 2016b), which we combine with the original the target side of the parallel data and a subpart of
parallel data in a 1-to-1 ratio. We use minibatches the monolingual Romanian corpora selected with
of size 80, a maximum sentence length of 50, word Xenc-v2 (Rousseau, 2013). For the latter we use
embeddings of size 500, and hidden layers of size all the parallel data as in-domain data and the first
1024. We apply dropout to all layers (Gal, 2015), half of newsdev2016 as development set. The fea-
with dropout probability 0.2, and also drop out full ture weights are tuned with MERT (Och, 2003) on
words with probability 0.1. We clip the gradient the first half of newsdev2016.
norm to 1.0 (Pascanu et al., 2013). We train the The system produces distinct 1000-best lists,
models with Adadelta (Zeiler, 2012), reshuffling for which we extend the feature set with the
the training corpus between epochs. We validate 17 baseline black-box features from sentence-
the model every 10 000 minibatches via B LEU on level Quality Estimation (QE) produced with
a validation set, and perform early stopping on Quest++4 (Specia et al., 2015). The 1000-best
B LEU. Decoding is performed with beam search lists are then reranked and the top-best hypothesis
with a beam size of 12. extracted using the nbest rescorer available within
A more detailed description of the system, and the Moses toolkit.
more experimental results, can be found in (Sen-
3.12 UvA
nrich et al., 2016a).
We use a phrase-based machine translation sys-
3.10 Edinburgh Phrase-based System tem (Moses) with a distortion limit of 6 and lex-
Edinburgh’s phrase-based system is built using icalized reordering. Before translation, the En-
the Moses toolkit, with fast align (Dyer et al., glish source side is preordered using the neural
2013) for word alignment, and KenLM (Heafield preordering model of (de Gispert et al., 2015). The
et al., 2013) for language model training. In our preordering model is trained for 30 iterations on
Moses setup, we use hierarchical lexicalized re- the full MGIZA-aligned training data. We use two
ordering (Galley and Manning, 2008), operation language models, built using KenLM. The first is
sequence model (Durrani et al., 2013), domain in- a 5-gram language model trained on all available
dicator features, and binned phrase count features. data. Words in the Common Crawl dataset that ap-
We use all available parallel data for the transla- pear fewer than 500 times were replaced by UNK,
tion model, and all available Romanian text for the and all singleton ngrams of order 3 or higher were
language model. We use two different 5-gram lan- pruned. We also use a 7-gram class-based lan-
guage models; one built from all the monolingual guage model, trained on the same data. 512 word
target text concatenated, without pruning, and one 4
http://www.quest.dcs.shef.ac.uk/
quest_files/features_blackbox_baseline_
3
https://github.com/rsennrich/nematus 17
Page 37 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 38 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 1: Results of the individual systems for the English→Romanian task. B LEU [%] and T ER [%]
scores are case-sensitive.
UEDIN-LMU HPBT
RWTH HPBT
UEDIN NMT
RWTH NMT
UEDIN PBT
LMU-CUNI
Average
LIMSI
USFD
LMU
Tilde
UvA
KIT
KIT - 55.0 55.9 51.7 56.2 48.2 50.3 54.6 55.1 42.8 56.6 54.1 52.8
LIMSI 29.3 - 54.3 52.1 51.8 43.0 49.8 55.3 56.2 38.2 57.3 52.1 51.4
LMU-CUNI 28.5 30.8 - 52.4 53.3 43.8 55.4 56.0 56.6 39.3 58.6 56.6 52.9
LMU 31.2 32.0 31.7 - 53.6 43.1 49.1 59.4 58.6 37.8 56.1 55.8 51.8
RWTH HPBT 28.5 32.4 31.2 30.8 - 47.5 50.1 54.9 55.6 41.8 53.9 55.3 52.2
RWTH NMT 33.7 37.9 37.3 37.5 34.8 - 40.8 44.3 45.3 46.0 43.8 43.6 44.5
Tilde 32.2 33.7 29.6 33.8 33.4 39.6 - 53.4 58.5 36.5 55.5 52.0 50.1
UEDIN-LMU HPBT 29.5 29.9 29.4 27.3 29.8 36.9 30.9 - 62.8 38.9 59.6 56.2 54.1
UEDIN PBT 28.4 28.9 28.5 27.0 29.3 35.4 27.0 24.2 - 39.4 60.2 58.6 55.2
UEDIN NMT 38.6 42.6 42.0 43.0 40.1 35.5 44.0 42.1 41.1 - 38.2 38.2 39.7
USFD 27.6 28.8 27.4 28.8 30.4 37.0 29.1 26.5 25.7 42.6 - 58.8 54.4
UvA 29.9 32.0 28.6 29.2 29.6 37.5 31.5 29.0 26.5 43.2 26.9 - 52.9
Average 30.7 32.6 31.4 32.0 31.8 36.6 33.2 30.5 29.3 41.3 30.0 31.3 -
Table 2: Comparison of system outputs against each other, generated by computing B LEU and T ER on
the system translations for newstest2016. One system in a pair is used as the reference, the other as
candidate translation; we report the average over both directions. The upper-right half lists B LEU [%]
scores, the lower-left half T ER [%] scores.
Page 39 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
UEDIN-LMU HPBT
RWTH HPBT
UEDIN NMT
RWTH NMT
Combination
UEDIN PBT
LMU-CUNI
LIMSI
USFD
LMU
Tilde
UvA
KIT
Attribute
Case 46.7% 46.0% 46.3% 45.7% 47.7% 48.0% 44.4% 46.3% 47.4% 49.8% 45.4% 45.4% 50.8%
Definite 50.5% 49.1% 50.0% 49.2% 50.5% 50.1% 47.2% 50.0% 50.5% 51.0% 49.2% 48.9% 53.3%
Gender 51.9% 51.0% 51.9% 51.3% 52.6% 52.1% 49.6% 51.9% 52.7% 53.0% 51.2% 50.9% 54.9%
Number 53.2% 51.7% 52.6% 52.3% 53.6% 53.7% 50.6% 52.9% 53.6% 54.9% 52.1% 51.8% 56.3%
Person 52.8% 51.3% 52.0% 52.0% 53.5% 55.0% 50.6% 52.6% 53.4% 57.2% 52.4% 51.6% 57.1%
Tense 45.8% 44.1% 44.7% 44.8% 45.7% 45.5% 42.3% 45.2% 45.1% 46.6% 44.9% 44.8% 48.0%
Verb form 45.9% 44.4% 45.5% 44.9% 46.6% 47.0% 43.9% 46.1% 46.5% 47.2% 45.5% 43.3% 48.7%
Reference words
57.7% 56.7% 57.3% 57.3% 58.3% 57.6% 55.7% 58.0% 58.5% 58.3% 57.3% 56.8% 60.4%
with alignment
Table 3: Precision of each system on morphological attribute prediction computed over the reference
translation using METEOR alignments. The last row shows the ratio of reference words for which
METEOR managed to find an alignment in the hypothesis.
Page 40 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Yoshua Bengio. 2010. Theano: a CPU and Adrià de Gispert, Gonzalo Iglesias, and Bill Byrne.
GPU math expression compiler. In Proceedings 2015. Fast and accurate preordering for SMT us-
of the Python for Scientific Computing Conference ing neural networks. In Proceedings of the 2015
(SciPy), June. Oral Presentation. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Fabienne Braune, Nina Seemann, and Alexander Language Technologies, pages 1012–1017, Denver,
Fraser. 2015. Rule Selection with Soft Syn- Colorado, May–June.
tactic Features for String-to-Tree Statistical Ma-
chine Translation. In Proc. of the Conf. on Em- Nadir Durrani, Alexander Fraser, Helmut Schmid,
pirical Methods for Natural Language Processing Hieu Hoang, and Philipp Koehn. 2013. Can
(EMNLP). Markov Models Over Minimal Translation Units
Help Phrase-Based SMT? In Proceedings of the
Fabienne Braune, Alexander Fraser, Hal Daumé III, 51st Annual Meeting of the Association for Compu-
and Aleš Tamchyna. 2016. A Framework for Dis- tational Linguistics (Volume 2: Short Papers), pages
criminative Rule Selection in Hierarchical Moses. 399–405, Sofia, Bulgaria, August.
In Proc. of the ACL 2016 First Conf. on Machine
Chris Dyer, Victor Chahuneau, and Noah A. Smith.
Translation (WMT16), Berlin, Germany, August.
2013. A Simple, Fast, and Effective Reparameter-
ization of IBM Model 2. In Proceedings of NAACL-
Francesco Casacuberta and Enrique Vidal. 2004. Ma-
HLT, pages 644–648, Atlanta, Georgia, June.
chine translation with inferred stochastic finite-state
transducers. Computational Linguistics, 30(3):205– Markus Freitag, Stephan Peitz, Joern Wuebker, Her-
225. mann Ney, Nadir Durrani, Matthias Huck, Philipp
Koehn, Thanh-Le Ha, Jan Niehues, Mohammed
Boxing Chen, Roland Kuhn, George Foster, and Mediani, Teresa Herrmann, Alex Waibel, Nicola
Howard Johnson. 2011. Unpacking and Transform- Bertoldi, Mauro Cettolo, and Marcello Federico.
ing Feature Functions: New Ways to Smooth Phrase 2013. EU-BRIDGE MT: Text Translation of Talks
Tables. In MT Summit XIII, pages 269–275, Xia- in the EU-BRIDGE Project. In Proc. of the
men, China, September. Int. Workshop on Spoken Language Translation
(IWSLT), pages 128–135, Heidelberg, Germany, De-
Colin Cherry and George Foster. 2012. Batch Tun- cember.
ing Strategies for Statistical Machine Translation. In
Proc. of the Conf. of the North American Chapter of Markus Freitag, Matthias Huck, and Hermann Ney.
the Assoc. for Computational Linguistics: Human 2014a. Jane: Open Source Machine Translation
Language Technologies (NAACL-HLT), pages 427– System Combination. In Proc. of the Conf. of the
436, Montréal, Canada, June. European Chapter of the Assoc. for Computational
Linguistics (EACL), pages 29–32, Gothenberg, Swe-
David Chiang. 2005. A Hierarchical Phrase-Based den, April.
Model for Statistical Machine Translation. In Proc.
of the 43rd Annual Meeting of the Association for Markus Freitag, Stephan Peitz, Joern Wuebker, Her-
Computational Linguistics (ACL), pages 263–270, mann Ney, Matthias Huck, Rico Sennrich, Nadir
Ann Arbor, Michigan, June. Durrani, Maria Nadejde, Philip Williams, Philipp
Koehn, Teresa Herrmann, Eunah Cho, and Alex
David Chiang. 2007. Hierarchical Phrase-Based Waibel. 2014b. EU-BRIDGE MT: Combined Ma-
Translation. Computational Linguistics, 33(2):201– chine Translation. In Proc. of the Workshop on
228. Statistical Machine Translation (WMT), pages 105–
113, Baltimore, MD, USA, June.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
Markus Freitag, Joern Wuebker, Stephan Peitz, Her-
cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
mann Ney, Matthias Huck, Alexandra Birch, Nadir
ger Schwenk, and Yoshua Bengio. 2014. Learn-
Durrani, Philipp Koehn, Mohammed Mediani, Is-
ing Phrase Representations using RNN Encoder–
abel Slawik, Jan Niehues, Eunah Cho, Alex Waibel,
Decoder for Statistical Machine Translation. In Pro-
Nicola Bertoldi, Mauro Cettolo, and Marcello Fed-
ceedings of the 2014 Conference on Empirical Meth-
erico. 2014c. Combined Spoken Language Trans-
ods in Natural Language Processing (EMNLP),
lation. In Proc. of the Int. Workshop on Spoken
pages 1724–1734, Doha, Qatar, October. Associa-
Language Translation (IWSLT), pages 57–64, Lake
tion for Computational Linguistics.
Tahoe, CA, USA, December.
Joseph M. Crego and José B. Mariño. 2006. Improving Yarin Gal. 2015. A Theoretically Grounded Appli-
statistical MT by coupling reordering and decoding. cation of Dropout in Recurrent Neural Networks.
Machine translation, 20(3):199–215, Jul. ArXiv e-prints.
Josep Maria Crego, Franois Yvon, and José B. Mariño. Michel Galley and Christopher D. Manning. 2008. A
2011. N-code: an open-source Bilingual N-gram simple and effective hierarchical phrase reordering
SMT Toolkit. Prague Bulletin of Mathematical Lin- model. In Proceedings of the Conference on Em-
guistics, 96:49–58. pirical Methods in Natural Language Processing,
Page 41 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
pages 848–856, Stroudsburg, PA, USA. Association Proc. of the EMNLP 2011 Workshop on Unsuper-
for Computational Linguistics. vised Learning in NLP, pages 91–96, Edinburgh,
Scotland, UK, July.
Qin Gao and Stephan Vogel. 2008. Parallel implemen-
tations of word alignment tool. In Software Engi- Matthias Huck, Joern Wuebker, Felix Rietig, and Her-
neering, Testing, and Quality Assurance for Natural mann Ney. 2013. A Phrase Orientation Model
Language Processing, pages 49–57. Association for for Hierarchical Machine Translation. In Proc. of
Computational Linguistics. the Workshop on Statistical Machine Translation
(WMT), pages 452–463, Sofia, Bulgaria, August.
Felix A. Gers, Jürgen Schmidhuber, and Fred Cum-
mins. 2000. Learning to forget: Contin- Matthias Huck, Alexander Fraser, and Barry Haddow.
ual prediction with LSTM. Neural computation, 2016. The Edinburgh/LMU Hierarchical Machine
12(10):2451–2471. Translation System for WMT 2016. In Proc. of
the ACL 2016 First Conf. on Machine Translation
Felix A. Gers, Nicol N. Schraudolph, and Jürgen
(WMT16), Berlin, Germany, August.
Schmidhuber. 2003. Learning precise timing with
lstm recurrent networks. The Journal of Machine
Howard Johnson, Joel Martin, George Foster, and
Learning Research, 3:115–143.
Roland Kuhn. 2007. Improving Translation Qual-
Joshua Goodman. 2001. Classes for fast maximum ity by Discarding Most of the Phrasetable. In Proc.
entropy training. CoRR, cs.CL/0108006. of EMNLP-CoNLL 2007.
Spence Green, Daniel Cer, and Christopher D. Man- Philipp Koehn and Barry Haddow. 2009. Edinburgh’s
ning. 2014. An empirical comparison of features Submission to all Tracks of the WMT 2009 Shared
and tuning for phrase-based machine translation. In Task with Reordering and Speed Improvements to
In Procedings of the Ninth Workshop on Statistical Moses. In Proceedings of the Fourth Workshop
Machine Translation. on Statistical Machine Translation, pages 160–164,
Athens, Greece.
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
Clark, and Philipp Koehn. 2013. Scalable Modified Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Kneser-Ney Language Model Estimation. pages Callison-Burch, Marcello Federico, Nicola Bertoldi,
690–696, Sofia, Bulgaria, August. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Kenneth Heafield. 2011. KenLM: Faster and Smaller Constantine, and Evan Herbst. 2007. Moses: Open
Language Model Queries. In Proceedings of the Source Toolkit for Statistical Machine Translation.
EMNLP 2011 Sixth Workshop on Statistical Ma- pages 177–180, Prague, Czech Republic, June.
chine Translation, pages 187–197, Edinburgh, Scot-
land, United Kingdom, July. Shankar Kumar and William Byrne. 2004. Minimum
Bayes-Risk Decoding for Statistical Machine Trans-
Teresa Herrmann, Jan Niehues, and Alex Waibel. lation. In HLT 2004 - Human Language Technology
2015. Source Discriminative Word Lexicon for Conference, Boston, MA, May.
Translation Disambiguation. In Proceedings of the
12th International Workshop on Spoken Language José B. Mariño, Rafael E. Banchs, Josep M. Crego,
Translation (IWSLT15), Danang, Vietnam. Adrià de Gispert, Patrik Lambert, José A. R. Fonol-
losa, and Marta R. Costa-jussà. 2006. N-gram-
Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.
based machine translation. Comput. Linguist.,
A Unified Framework for Phrase-Based, Hierarchi-
32(4):527–549, December.
cal, and Syntax-Based Statistical Machine Transla-
tion. In Proc. of the Int. Workshop on Spoken Lan-
R.C. Moore and W. Lewis. 2010. Intelligent Selection
guage Translation (IWSLT), pages 152–159, Tokyo,
of Language Model Training Data. In ACL (Short
Japan, December.
Papers), pages 220–224, Uppsala, Sweden, July.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation, Frederic Morin and Yoshua Bengio. 2005. Hierarchi-
9(8):1735–1780. cal probabilistic neural network language model. In
Robert G. Cowell and Zoubin Ghahramani, editors,
Liang Huang and David Chiang. 2007. Forest Rescor- Proceedings of the Tenth International Workshop on
ing: Faster Decoding with Integrated Language Artificial Intelligence and Statistics, pages 246–252.
Models. In Proceedings of the 45th Annual Meet- Society for Artificial Intelligence and Statistics.
ing of the Association for Computational Linguis-
tics, pages 144–151, Prague, Czech Republic, June. J. Niehues and A. Waibel. 2012. Detailed Analysis of
Different Strategies for Phrase Table Adaptation in
Matthias Huck, David Vilar, Daniel Stein, and Her- SMT. In Proceedings of the 10th Conference of the
mann Ney. 2011. Lightly-Supervised Training for Association for Machine Translation in the Ameri-
Hierarchical Phrase-Based Machine Translation. In cas, San Diego, CA, USA.
Page 42 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Jan Niehues and Alex Waibel. 2013. An MT Error- Lucia Specia, Gustavo Paetzold, and Carolina Scar-
Driven Discriminative Word Lexicon using Sen- ton. 2015. Multi-level translation quality prediction
tence Structure Features. In Proceedings of the 8th with quest++. In Proceedings of ACL-IJCNLP 2015
Workshop on Statistical Machine Translation, Sofia, System Demonstrations, pages 115–120, Beijing,
Bulgaria. China, July. Association for Computational Linguis-
tics and The Asian Federation of Natural Language
Jan Niehues, Quoc Khanh Do, Alexandre Allauzen, Processing.
and Alex Waibel. 2015. Listnet-based MT Rescor-
ing. EMNLP 2015, page 248. Andreas Stolcke. 2002. SRILM – An Extensible Lan-
guage Modeling Toolkit. In Proc. of the Int. Conf.
Franz Josef Och. 1999. An Efficient Method for Deter- on Speech and Language Processing (ICSLP), vol-
mining Bilingual Word Classes. In Proceedings of ume 2, pages 901–904, Denver, CO, September.
the 9th Conference of the European Chapter of the
Association for Computational Linguistics, Bergen, Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
Norway. 2012. LSTM Neural Networks for Language Mod-
eling. In Interspeech, Portland, OR, USA, Septem-
Franz Josef Och. 2003. Minimum Error Rate Training
ber.
in Statistical Machine Translation. In Proc. of the
41th Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 160–167, Sapporo, Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,
Japan, July. and Hermann Ney. 2014a. Translation Modeling
with Bidirectional Recurrent Neural Networks. In
Razvan Pascanu, Tomas Mikolov, and Yoshua Ben- Conference on Empirical Methods in Natural Lan-
gio. 2013. On the difficulty of training recurrent guage Processing, pages 14–25, Doha, Qatar, Octo-
neural networks. In Proceedings of the 30th Inter- ber.
national Conference on Machine Learning, ICML
2013, pages 1310–1318, , Atlanta, GA, USA. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
2014b. rwthlm - The RWTH Aachen University
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Neural Network Language Modeling Toolkit . In In-
Klein. 2006. Learning Accurate, Compact, and In- terspeech, pages 2093–2097, Singapore, September.
terpretable Tree Annotation. In Proc. of the 21st
Int. Conf. on Computational Linguistics and the 44th Aleš Tamchyna, Fabienne Braune, Alexander M.
Annual Meeting of the Assoc. for Computational Fraser, Marine Carpuat, Hal Daumé III, and Chris
Linguistics, pages 433–440. Quirk. 2014. Integrating a Discriminative Classi-
fier into Phrase-based and Hierarchical Decoding.
Kay Rottmann and Stephan Vogel. 2007. Word Re- The Prague Bulletin of Mathematical Linguistics
ordering in Statistical Machine Translation with a (PBML), 101:29–42.
POS-Based Distortion Model. In Proceedings of
the 11th International Conference on Theoretical Aleš Tamchyna, Alexander Fraser, Ondřej Bojar, and
and Methodological Issues in Machine Translation, Marcin Junczsys-Dowmunt. 2016. Target-Side
Skövde, Sweden. Context for Discriminative Models in Statistical Ma-
chine Translation. In Proc. of ACL, Berlin, Ger-
Anthony Rousseau. 2013. Xenc: An open-source many, August. Association for Computational Lin-
tool for data selection in natural language process- guistics.
ing. The Prague Bulletin of Mathematical Linguis-
tics, (100):73–82.
Dan Tufiş, Radu Ion, Ru Ceauşu, and Dan Ştefănescu.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2008. RACAI’s Linguistic Web Services. In
2016a. Edinburgh Neural Machine Translation Sys- Proceedings of the Sixth International Language
tems for WMT 16. In Proceedings of the First Con- Resources and Evaluation (LREC’08), Marrakech,
ference on Machine Translation (WMT16), Berlin, Morocco, May. European Language Resources As-
Germany. sociation (ELRA).
Rico Sennrich, Barry Haddow, and Alexandra Birch. Bart van Merriënboer, Dzmitry Bahdanau, Vincent Du-
2016b. Improving Neural Machine Translation moulin, Dmitriy Serdyuk, David Warde-Farley, Jan
Models with Monolingual Data. In Proceedings Chorowski, and Yoshua Bengio. 2015. Blocks
of the 54th Annual Meeting of the Association for and Fuel: Frameworks for deep learning. CoRR,
Computational Linguistics (ACL 2016), Berlin, Ger- abs/1506.00619.
many.
Andrejs Vasijevs, Raivis Skadiš, and Jörg Tiedemann.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2012. LetsMT!: A Cloud-Based Platform for Do-
2016c. Neural Machine Translation of Rare Words It-Yourself Machine Translation. In Min Zhang, ed-
with Subword Units. In Proceedings of the 54th An- itor, Proceedings of the ACL 2012 System Demon-
nual Meeting of the Association for Computational strations, number July, pages 43–48, Jeju Island,
Linguistics (ACL 2016), Berlin, Germany. Korea. Association for Computational Linguistics.
Page 43 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 44 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
We describe the submissions of ILLC 3. removing the recall bias from BEER .
UvA to the metrics and tuning tasks on
WMT15. Both submissions are based In Section 2 we give a short introduction to
on the BEER evaluation metric origi- BEER after which we move to the innovations for
nally presented on WMT14 (Stanojević this year in Sections 3, 4 and 5. We show the re-
and Sima’an, 2014a). The main changes sults from the metric and tuning tasks in Section 6,
introduced this year are: (i) extending and conclude in Section 7.
the learning-to-rank trained sentence level
2 BEER basics
metric to the corpus level (but still decom-
posable to sentence level), (ii) incorporat- The model underying the BEER metric is flexible
ing syntactic ingredients based on depen- for the integration of an arbitrary number of new
dency trees, and (iii) a technique for find- features and has a training method that is targeted
ing parameters of BEER that avoid “gam- for producing good rankings among systems. Two
ing of the metric” during tuning. other characteristic properties of BEER are its hi-
erarchical reordering component and character n-
1 Introduction grams lexical matching component.
In the 2014 WMT metrics task, BEER turned up as 2.1 Old BEER scoring
the best sentence level evaluation metric on aver-
age over 10 language pairs (Machacek and Bojar, BEER is essentially a linear model with which the
2014). We believe that this was due to: score can be computed in the following way:
396
Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 396–401,
Lisboa, Portugal, 17-18 September 2015. c 2015 Association for Computational Linguistics.
Page 45 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
397
Page 46 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
h2, 4, 1, 3i h2, 1i
h2, 1i
h2, 1i 1
2 h2, 1i 1 3
h2, 1i h2, 1i h2, 1i 2
h1, 2i 4
5 6 4 3 2 1 4 3
(a) Complex PET (b) Fully inverted PET 1 (c) Fully inverted PET 2
398
Page 47 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
tuning metric BLEU MTR BEER Length System Name TrueSkill Score BLEU
Tuning-Only All
BEER 16.4 28.4 10.2 115.7 BLEU -MIRA- DENSE 0.153 -0.177 12.28
BLEU 18.2 28.1 10.1 103.0 ILLC-U VA 0.108 -0.188 12.05
BLEU -MERT- DENSE 0.087 -0.200 12.11
BEER no bias 18.0 27.7 9.8 99.7
AFRL 0.070 -0.205 12.20
USAAR-T UNA 0.011 -0.220 12.16
Table 1: Tuning results with BEER without bias DCU -0.027 -0.256 11.44
on WMT14 as tuning and WMT13 as test set METEOR-CMU -0.101 -0.286 10.88
BLEU -MIRA- SPARSE -0.150 -0.331 10.84
HKUST -0.150 -0.331 10.99
HKUST-LATE — — 12.20
correlation with human judgment, but when used
for tuning they will create overly long translations. Table 6: Results on Czech-English tuning
This bias for long translation is often resolved
by manually setting the weights of recall and pre-
cision to be equal (Denkowski and Lavie, 2011;
The difference between BEER and
He and Way, 2009).
BEER Treepel are relatively big for de-en,
This problem is even bigger with metrics with
cs-en and ru-en while for fr-en and fi-en the
many features. When we have metric like
difference does not seem to be big.
BEER Treepel which has 117 features it is not
clear how to set weights for each feature manu- The results of WMT15 tuning task is shown in
ally. Also some features might not have easy inter- Table 6. The system tuned with BEER without re-
pretation as precision or recall of something. Our call bias was the best submitted system for Czech-
method for automatic removing of this recall bias, English and only the strong baseline outperformed
which is presented in (Stanojević, 2015), gives it.
very good results that can be seen in Table 1.
Before the automatic adaptation of weights 7 Conclusion
for tuning, tuning with standard BEER produces
translations that are 15% longer than the refer- We have presented ILLC UvA submission to the
ence translations. This behavior is rewarded by shared metric and tuning task. All submissions
metrics that are recall-heavy like METEOR and are centered around BEER evaluation metric. On
BEER and punished by precision heavy metrics the metrics task we kept the good results we had
like BLEU. After automatic adaptation of weights, on sentence level and extended our metric to cor-
tuning with BEER matches the length of reference pus level with high correlation with high human
translation even better than BLEU and achieves judgment without losing the decomposability of
the BLEU score that is very close to tuning with the metric to the sentence level. Integration of syn-
BLEU. This kind of model is disliked by ME- tactic features gave a bit of improvement on some
TEOR and BEER but by just looking at the length language pairs. The removal of recall bias allowed
of the produced translations it is clear which ap- us to go from overly long translations produced
proach is preferred. in tuning to translations that match reference rel-
atively close by length and won the 3rd place in
6 Metric and Tuning task results the tuning task. BEER is available at https:
//github.com/stanojevic/beer.
The results of WMT15 metric task of best per-
forming metrics is shown in Tables 2 and 3 for the
system level and Tables 4 and 5 for segment level. Acknowledgments
On the sentence level for out of English lan-
guage pairs on average BEER was the best met- This work is supported by STW grant nr. 12271
ric (same as the last year). Into English it got 2nd and NWO VICI grant nr. 277-89-002. QT21
place with its syntactic version and 4th place as the project support to the second author is also
original BEER . acknowledged (European Unions Horizon 2020
On the corpus level BEER is on average second grant agreement no. 64545). We are thankful to
for out of English language pairs and 6th for into Christos Louizos for help with incorporating a de-
English. BEER and BEER Treepel are the best for pendency parser to BEER Treepel.
en-ru and fi-en.
399
Page 48 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 2: System-level correlations of automatic evaluation metrics and the official WMT human scores
when translating into English.
Table 3: System-level correlations of automatic evaluation metrics and the official WMT human scores
when translating out of English.
Table 4: Segment-level Kendall’s τ correlations of automatic evaluation metrics and the official WMT
human judgments when translating into English. The last three columns contain average Kendall’s τ
computed by other variants.
Table 5: Segment-level Kendall’s τ correlations of automatic evaluation metrics and the official WMT
human judgments when translating out of English. The last three columns contain average Kendall’s τ
computed by other variants.
400
Page 49 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Michael Denkowski and Alon Lavie. 2011. Meteor Miloš Stanojević and Khalil Sima’an. 2014c. Fitting
1.3: Automatic metric for reliable optimization and Sentence Level Translation Evaluation with Many
evaluation of machine translation systems. In Pro- Dense Features. In Proceedings of the 2014 Con-
ceedings of the Sixth Workshop on Statistical Ma- ference on Empirical Methods in Natural Language
chine Translation, WMT ’11, pages 85–91, Strouds- Processing (EMNLP), pages 202–206, Doha, Qatar,
burg, PA, USA. Association for Computational Lin- October. Association for Computational Linguistics.
guistics.
Miloš Stanojević. 2015. Removing Biases from Train-
Y. He and A. Way. 2009. Improving the objective func- able MT Metrics by Using Self-Training. arXiv
tion in minimum error rate training. Proceedings preprint arXiv:1508.02445.
of the Twelfth Machine Translation Summit, pages
238–245. Hui Yu, Xiaofeng Wu, Jun Xie, Wenbin Jiang, Qun Liu,
and Shouxun Lin. 2014. Red: A reference depen-
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. dency based mt evaluation metric. In COLING’14,
1999. Support Vector Learning for Ordinal Regres- pages 2042–2051.
sion. In In International Conference on Artificial
Neural Networks, pages 97–102. Hao Zhang and Daniel Gildea. 2007. Factorization of
synchronous context-free grammars in linear time.
Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito In In NAACL Workshop on Syntax and Structure in
Sudoh, and Hajime Tsukada. 2010. Automatic Statistical Translation (SSST.
Evaluation of Translation Quality for Distant Lan-
guage Pairs. In Proceedings of the 2010 Conference
on Empirical Methods in Natural Language Pro-
cessing, EMNLP ’10, pages 944–952, Stroudsburg,
PA, USA. Association for Computational Linguis-
tics.
Matous Machacek and Ondrej Bojar. 2014. Results
of the wmt14 metrics shared task. In Proceedings
of the Ninth Workshop on Statistical Machine Trans-
lation, pages 293–301, Baltimore, Maryland, USA,
June. Association for Computational Linguistics.
Michael Denkowski and Alon Lavie. 2014. Meteor
Universal: Language Specific Translation Evalua-
tion for Any Target Language. In Proceedings of the
ACL 2014 Workshop on Statistical Machine Transla-
tion.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: A method for automatic
evaluation of machine translation. In Proceedings
of the 40th Annual Meeting on Association for Com-
putational Linguistics, ACL ’02, pages 311–318,
Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
Maja Popović and Hermann Ney. 2009. Syntax-
oriented evaluation measures for machine transla-
tion output. In Proceedings of the Fourth Work-
shop on Statistical Machine Translation, StatMT
’09, pages 29–32, Stroudsburg, PA, USA. Associ-
ation for Computational Linguistics.
401
Page 50 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
1 2
http://www.statmt.org/wmt16/ All our experiments optimize the default BLEU but other
tuning-task/ metrics could be directly tested as well.
515
Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 515–521,
Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics
Page 51 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
The process of finding the best translation e∗ is model has 21 features but adding sparse features,
called decoding. The translations can vary signif- we can get to thousands of dimensions.
icantly based on the values of the weights, there- These properties of the search space make PSO
fore it is necessary to find the weights that would an interesting candidate for the inner loop algo-
give the best result. This is achieved by minimiz- rithm. PSO is stochastic so it doesn’t require
ing the error of the machine translation against the smoothness of the optimized function. It is also
human translation: highly parallelizable and gains more power with
more CPUs available, which is welcome since the
λ∗ = argmin errf (gp (λ), ehuman ) (2) optimization itself is quite expensive. The simplic-
λ
ity of PSO also leaves space for various improve-
The error function can also be considered as a ments.
negative value of an automated scorer. The prob-
lem with this straight-forward approach is that de-
coding is computationally expensive. To reduce 3 PSO Algorithm
this cost, the decoder is not run for every consid-
ered weight setting. Instead, only some promis- The PSO algorithm was first described by Eber-
ing settings are tested in a loop (called the “outer hart et al. (1995). PSO is an iterative optimization
loop”): given the current best weights, the decoder method inspired by the behavior of groups of ani-
is asked to produce n best translation for each mals such as flocks of birds or schools of fish. The
sentence of the tuning set. This enlarged set of space is searched by individual particles with their
candidates allows us to estimate translation scores own positions and velocities. The particles can in-
for similar weight settings. An optimizer uses form others of their current and previous positions
these estimates to propose a new vector of weights and their properties.
and the decoder then tests this proposal in another
outer loop. The outer loop is stopped when no new
weight setting is proposed by the optimizer or no 3.1 TPSO
new translations are found by the decoder. The
The original algorithm is defined quite generally.
run of the optimizer is called the “inner loop”, al-
Let us formally introduce the procedure. The
though it need not be iterative in any sense. The
search space S is defined as
optimizer tries to find the best weights so that the
least erroneous translations appear as high as pos-
sible in the n-best lists of candidate translations. D
O
Our algorithm replaces the inner loop of MERT. S= [mind , maxd ] (3)
It is therefore important to describe the properties d=1
of the inner loop optimization task.
Due to finite number of translations accumu- where D is the dimension of the space and
lated in the n-best lists (across sentences as well as mind and maxd are the minimal and maximal
outer loop iterations), the error function changes values for the d-th coordinate. We try to find a
only when the change in weights leads to a change point in the space which maximizes a given func-
in the order of the n-best list. This is represented tion f : S 7→ R.
by numerous plateaus in the error function with
There are p particles and the i-th particle in
discontinuities on the edges of the plateaus. This
the n-th iteration has the following D-dimensional
prevents the use of simple gradient methods. We
vectors: position xni , velocity vin , and two vectors
can define a local optimum not in a strict math-
of maxima found so far: the best position pni vis-
ematical sense but as a plateau which has only
ited by the particle itself and the best known po-
higher or only lower plateaus at the edges. These
sition lni that the particle has learned about from
local optima can then be numerous within the
others.
search space and trap any optimizing algorithm,
thus preventing convergence to the global opti- In TPSO algorithm, the lni vector is always the
mum which is desired. globally best position visited by any particle so far.
Another problem is the relatively high dimen- The TPSO algorithm starts with simple initial-
sionality of the search space. The Tuning Task ization:
516
Page 52 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
517
Page 53 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
518
Page 54 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 1: The final best BLEU score after the runs of the inner loop for PSO without and with the
termination condition with 16 and 64 threads respectively and standard Moses MERT implementation
with 16 threads.
519
Page 55 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 2: Average run times and reached scores. The ± are standard deviations.
n-best lists as accumulated in iteration 1 and 3 of ing language resources stored and distributed
the outer loop. Note that PSO and PSO-T use only by the LINDAT/CLARIN project of the Min-
as many particles as there are threads, so running istry of Education, Youth and Sports of the
them with just one thread leads to a degraded per- Czech Republic (project LM2015071). Compu-
formace in terms of BLEU. With 4 or 8 threads, tational resources were supplied by the Ministry
the three methods are on par in terms of tuning- of Education, Youth and Sports of the Czech
set BLEU. Starting from 4 threads, both PSO and Republic under the Projects CESNET (Project
PSO-T terminate faster than the baseline MERT No. LM2015042) and CERIT-Scientific Cloud
implementation. Moreover the baseline MERT (Project No. LM2015085) provided within the
proved unable to utilize multiple CPUs efficiently, program Projects of Large Research, Development
whereas PSO gives us up to 14-fold speedup. and Innovations Infrastructures.
In general, the higher the ratio of the serial data
loading to the search computation time, the worse References
the speedup. The search in PSO-T takes much Nicola Bertoldi, Barry Haddow, and Jean-Baptiste
shorter time so the overhead of serial data loading Fouet. 2009. Improved minimum error rate
is more apparent and PSO-T seems parallelized training in moses. The Prague Bulletin of Math-
badly and gives only quadruple speedup. The re- ematical Linguistics 91:7–16.
duction of this overhead is highly desirable. Mohammad Reza Bonyadi and Zbigniew
Michalewicz. 2014. Spso 2011: Analysis of
6 Conclusion
stability; local convergence; and rotation sensi-
We presented our submission to the WMT16 Tun- tivity. In Proceedings of the 2014 conference on
ing Task, a variant of particle swarm optimization Genetic and evolutionary computation. ACM,
applied to minimum error-rate training in statisti- pages 9–16.
cal machine translation. Our method is a drop-in Maurice Clerc. 2012. Standard particle swarm op-
replacement of the standard Moses MERT and has timisation .
the benefit of easy parallelization. Preliminary ex- Russ C Eberhart, James Kennedy, et al. 1995.
periments suggest that it indeed runs faster and de- A new optimizer using particle swarm theory.
livers comparable weight settings. In Proceedings of the sixth international sym-
The effects on the number of iterations of the posium on micro machine and human science.
MERT outer loop as well as on the test-set perfor- New York, NY, volume 1, pages 39–43.
mance have still to be investigated.
George I Evers and Mounir Ben Ghalia. 2009. Re-
Acknowledgments grouping particle swarm optimization: a new
global optimization algorithm with improved
This work has received funding from the Eu- performance consistency across benchmarks. In
ropean Union’s Horizon 2020 research and in- Systems, Man and Cybernetics, 2009. SMC
novation programme under grant agreement no. 2009. IEEE International Conference on. IEEE,
645452 (QT21). This work has been us- pages 3901–3908.
520
Page 56 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
521
Page 57 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 58 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 59 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
ments. The bag-of-words input features fBoW,i can contextual words from the current word. There-
be seen as normalized n-of-N vectors as demon- fore the bag-of-words vector with decay weights
strated in Figure 1, where n is the number of words can be defined as following:
inside each bag-of-words. X
f˜BoW,i = d|i−k| f˜k (2)
k∈SBoW
where
[0 1 0 0 · · · 0 0] [0 0 0 1 · · · 0 0] [0 0 1 0 · · · 0 0] [ n1 1 1 1 1
0 0 0 ··· n n] [0 0 n 0 ··· n 0]
i, k Positions of the current word and words
original word features bag-of-words input features
within the BoW model respectively.
Figure 1: The bag-of-words input features along f˜BoW,i The value vector of the BoW input fea-
with the original word features. The input vectors ture for the i-th word in the sentence.
are projected and concatenated at the projection
layer. We omit the hidden and output layers for f˜k One-hot encoded feature vector of the k-
simplification, since they remain unchanged. th word in the sentence.
Page 60 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
other source words outside the context window: • Enhanced low frequency counts
{friends, had, been, talking} and {long, (Chen et al., 2011)
time}. Furthermore, there are multiple choices
for assigning decay weights to all these words in • 4-gram language model
the bag-of-words feature:
• 7-gram word class language model
Sentence: friends had been talking about this fish for a long time (Wuebker et al., 2013)
Weights: d6fish d5fish d4fish d3fish d3fish d4fish • Projection layer size 100 for each word
Page 61 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 1: Experimental results of translations using exponentially decaying bag-of-words models with
different kinds of decay rates. Improvements by systems marked by ∗ have a 95% statistical significance
from the baseline system, whereas † denotes the 95% statistical significant improvements with respect to
the BoW Features system (without decay weights). We experimented with several values for the fixed
decay rate (DR) and 0.9 performed best. The applied RNN model is the LSTM bidirectional translation
model proposed in (Sundermeyer et al., 2014).
translation tasks. Here we applied two bag-of- 3.3 Comparison between Bag-of-Words and
words models to separately contain the preced- Large Context Window
ing and succeeding words outside the context win- The main motivation behind the usage of the bag-
dow. We can see that the bag-of-words feature of-words input features is to provide the model
without exponential decay weights only provides with additional context information. We compared
small improvements. After appending the de- the bag-of-words input features to different source
cay weights, four different kinds of decay rates side windows to refute the argument that simply
provide further improvements to varying degrees. increasing the size of the window could achieve
The bag-of-words individual decay rate performs the same results. Our experiments showed that in-
the best, which gives us improvements by up to creasing the source side window beyond 11 gave
0.5% on B LEU and up to 0.6% on T ER. On these no more improvements while the model that used
tasks, these improvements even help the feed- the bag-of-words input features is able to achieve
forward neural network achieve a similar perfor- the best result (Figure 2). A possible explanation
mance to the popular long short-term memory re- for this could be that the feed-forward neural net-
current neural network model (Sundermeyer et al., work learns its input position-dependent. If one
2014), which contains three LSTM layers with source word is moved by one position the feed-
200 nodes each. The results of the word individual forward neural network needs to have seen a word
decay rate are worse than that of the bag-of-words with a similar word vector at this position dur-
decay rate. One reason is that in word individual ing training to interpret it correctly. The likeli-
case, the sequence order can still be missing. We hood of precisely getting the position decreases
initialize all values for the tunable decay rates with with a larger distance. The bag-of-words model
0.9. In the IWSLT 2013 German→English task, on the other hand will still get the same input only
the corpus decay rate is tuned to 0.578. When in- slightly stronger or weaker on the new distance
vestigating the values of the trained bag-of-words and decay rate.
individual decay rate vector, we noticed that the
variance of the value for frequent words is much
lower than for rare words. We also observed that 4 Conclusion
most function words, such as prepositions and
conjunctions, are assigned low decay rates. We The aim of this work was to investigate the influ-
could not find a pattern for the trained value vec- ence of exponentially decaying bag-of-words in-
tor of the word individual decay rates. put features with trained decay rates on the feed-
forward neural network translation model. Ap-
plying the standard bag-of-words model as an ad-
ditional input feature in our feed-forward neural
network translation model only yields slight im-
Page 62 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
37.2 References
5 words + 2 BoWs
B LEU scores on eval
This paper has received funding from the Euro- Tomas Mikolov, Armand Joulin, Sumit Chopra,
pean Union’s Horizon 2020 research and innova- Michael Mathieu, and Marc’Aurelio Ranzato. 2015.
tion programme under grant agreement no 645452 Learning longer memory in recurrent neural net-
works. In Proceedings of the 3rd International Con-
(QT21). ference on Learning Representations, San Diego,
CA, USA, May.
Page 63 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Franz Josef Och and Hermann Ney. 2003. A Sys- Joern Wuebker, Matthias Huck, Stephan Peitz, Malte
tematic Comparison of Various Statistical Align- Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab
ment Models. Computational Linguistics, 29:19– Mansour, and Hermann Ney. 2012. Jane 2:
51, March. Open Source Phrase-based and Hierarchical Statis-
tical Machine Translation. In International Confer-
Franz Josef Och. 2003. Minimum Error Rate Train- ence on Computational Linguistics, pages 483–491,
ing in Statistical Machine Translation. In Proceed- Mumbai, India, December.
ings of the 41st Annual Meeting on Association
for Computational Linguistics, pages 160–167, Sap- Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
poro, Japan, July. mann Ney. 2013. Improving Statistical Machine
Translation with Word Class Models. In Proceed-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ings of the 2013 Conference on Empirical Methods
Jing Zhu. 2002. BLEU: A Method for Automatic in Natural Language Processing, pages 1377–1381,
Evaluation of Machine Translation. In Proceedings Seattle, WA, USA, October.
of the 40th Annual Meeting on Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
PA, USA, July.
Holger Schwenk, Daniel Déchelotte, and Jean-Luc
Gauvain. 2006. Continuous space language models
for statistical machine translation. In Proceedings of
the 44th Annual Meeting of the International Com-
mittee on Computational Linguistics and the Asso-
ciation for Computational Linguistics, pages 723–
730, Sydney, Australia, July.
Holger Schwenk. 2012. Continuous space translation
models for phrase-based statistical machine transla-
tion. In Proceedings of the 24th International Con-
ference on Computational Linguistics, pages 1071–
1080, Mumbai, India, December.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
nea Micciulla, and John Makhoul. 2006. A study
of translation edit rate with targeted human annota-
tion. In Proceedings of the Conference of the As-
sociation for Machine Translation in the Americas,
pages 223–231, Cambridge, MA, USA, August.
Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,
and Hermann Ney. 2014. Translation modeling
with bidirectional recurrent neural networks. In Pro-
ceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing, pages 14–25,
Doha, Qatar, October.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
works. In Z Ghahramani, M Welling, C Cortes, N D
Lawrence, and K Q Weinberger, editors, Advances
in Neural Information Processing Systems 27, pages
3104–3112. Curran Associates, Inc.
Ashish Vaswani, Yinggong Zhao, Victoria Fossum,
and David Chiang. 2013. Decoding with large-
scale neural language models improves translation.
In Proceedings of the 2013 Conference on Empiri-
cal Methods in Natural Language Processing, pages
1387–1392, Seattle, WA, USA, October.
David Vilar, Daniel Stein, Matthias Huck, and Her-
mann Ney. 2010. Jane: Open Source Hierarchi-
cal Translation, Extended with Reordering and Lex-
icon Models. In ACL 2010 Joint Fifth Workshop on
Statistical Machine Translation and Metrics MATR,
pages 262–270, Uppsala, Sweden, July.
Page 64 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Page 65 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
a common and important question is which la- where N is the count of a phrase or a phrase pair
bel vocabulary maximizes the translation quality. in the training data. These counts are very low for
Bisazza and Monz (2014) compare class-based many phrases due to a limited amount of bilingual
language models with diverse kinds of labels in training data.
terms of their performance in translation into mor- Using a smaller vocabulary, we can aggregate
phologically rich languages. To the best of our the low counts and make the distribution smoother.
knowledge, there is no published work on sys- We now define two types of smoothing models for
tematic comparison between different label vocab- Equation 2 using a general word-label mapping c.
ularies, model forms, and training data size for
smoothing phrase translation models—the most 4.1 Mapping All Words at Once (map-all)
basic component in state-of-the-art SMT systems. For the phrase translation model, the simplest for-
Our work fulfills these needs with extensive trans- mulation of vocabulary reduction is obtained by
lation experiments (Section 5) and quantitative replacing all words in the source and target phrases
analysis (Section 6) in a standard phrase-based with the corresponding labels in a smaller space.
SMT framework. Namely, we employ the following probability in-
stead of Equation 2:
3 Word Classes
N (c(f˜), c(ẽ))
pall (f˜|ẽ) = (3)
In this work, we mainly use unsupervised word N (c(ẽ))
classes by Brown et al. (1992) as the reduced vo-
which we call map-all. This model resembles the
cabulary. This section briefly reviews the principle
word class translation model of Wuebker et al.
and properties of word classes.
(2013) except that we allow any kind of word-level
A word-class mapping c is estimated by a clus-
labels.
tering algorithm that maximizes the following ob-
This model generalizes all words of a phrase
jective (Brown et al., 1992):
without distinction between them. Also, the same
I
XX formulation is applied to word-based lexicon mod-
L := p(c(ei )|c(ei−1 )) · p(ei |c(ei )) (1) els.
eI1 i=1
4.2 Mapping Each Word at a Time
(map-each)
for a given monolingual corpus {eI1 },
where each
eI1 is a sentence of length I in the corpus. The More elaborate smoothing can be achieved by gen-
objective guides c to prefer certain collocations of eralizing only a sub-part of the phrase pair. The
class sequences, e.g. an auxiliary verb class should idea is to replace one source word at a time with
succeed a class of pronouns or person names. its respective label. For each source position j, we
Consequently, the resulting c groups words ac- also replace the target words aligned to the source
cording to their syntactic or semantic similarity. word fj . For this purpose, we let aj ⊆ {1, ..., |ẽ|}
Word classes have a big advantage for our com- denote a set of target positions aligned to j. The
parative study: The structure and size of the class resulting model takes a weighted average of the
vocabulary can be arbitrarily adjusted by the clus- redefined translation probabilities over all source
tering parameters. This makes it possible to pre- positions of f˜:
pare easily an abundant set of label vocabularies ˜
|f |
that differ in linguistic coherence and degree of
X N (c(j) (f˜), c(aj ) (ẽ))
peach (f˜|ẽ) = wj · (4)
generalization. j=1
N (c(aj ) (ẽ))
Page 66 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
e1 5 Experiments
e2
5.1 Setup
e3
f1 f 2 f 3 We evaluate how much the translation quality
is improved by the smoothing models in Sec-
Figure 1: Word alignments of a pair of three-word tion 4. The two smoothing models are trained
phrases. in both source-to-target and target-to-source di-
rections, and integrated as additional features in
the log-linear combination of a standard phrase-
peach ( [f1 , f2 , f3 ] | [e1 , e2 , e3 ] ) = based SMT system (Koehn et al., 2003). We also
test linear interpolation between the standard and
N ([c(f1 ), f2 , f3 ], [c(e1 ), e2 , e3 ]) smoothing models, but the results are generally
w1 ·
N ([c(e1 ), e2 , e3 ]) worse than log-linear interpolation. Note that vo-
N ([f1 , c(f2 ), f3 ], [e1 , e2 , e3 ]) cabulary reduction models by themselves cannot
+ w2 · replace the corresponding standard models, since
N ([e1 , e2 , e3 ])
this leads to a considerable drop in translation
N ([f1 , f2 , c(f3 )], [e1 , c(e2 ), c(e3 )]) quality (Wuebker et al., 2013).
+ w3 ·
N ([e1 , c(e2 ), c(e3 )]) Our baseline systems include phrase transla-
tion models in both directions, word-based lexi-
where the alignments are depicted by line seg- con models in both directions, word/phrase penal-
ments. ties, a distortion penalty, a hierarchical lexicalized
First of all, we replace f1 and also e1 , which is reordering model (Galley and Manning, 2008),
aligned to f1 , with their corresponding labels. As a 4-gram language model, and a 7-gram word
f2 has no alignment points, we do not replace any class language model (Wuebker et al., 2013). The
target word accordingly. f3 triggers the class re- model weights are trained with minimum error
placement of two target words at the same time. rate training (Och, 2003). All experiments are
Note that the model implicitly encapsulates the conducted with an open source phrase-based SMT
alignment information. toolkit Jane 2 (Wuebker et al., 2012).
We empirically found that the map-each model To validate our experimental results, we mea-
performs best with the following weight: sure the statistical significance using the paired
bootstrap resampling method of Koehn (2004).
Every result in this section is marked with ‡ if it
N (c(j) (f˜), c(aj ) (ẽ))
wj = (5) is statistically significantly better than the base-
|f˜|
line with 95% confidence, or with † for 90% con-
N (c(j 0 ) (f˜), c(aj 0 ) (ẽ))
P
j 0 =1 fidence.
Page 67 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
29.4
We carry out comparative experiments regard- map-each
map-all
ing the three factors of the clustering algorithm: ‡ Baseline
29.0 ‡
‡ † † ‡ ‡
BLEU [%]
‡ †
29.0
Baseline mappings:
‡ †
BLEU [%]
‡ ‡
‡
‡ • random: randomly assign words to
28.6 ‡ † classes
‡ ‡ †
• top-frequent (default): top-frequent
28.2
0 5 10 15 20 25 30 35
clustering iterations
words have their own classes, while all
other words are in the last class
Figure 2: B LEU scores for clustering iterations • same-countsum: each class has almost
when using individually tuned model weights for the same sum of word unigram counts
each iteration. Dots indicate those iterations in • same-#words: each class has almost the
which the translation is performed. same number of words
• count-bins: each class represents a bin
The score does not consistently increase or of the total count range
decrease over the iterations; it is rather on a
similar level (± 0.2% B LEU) for all settings B LEU T ER
with slight fluctuations. This is an important Initialization [%] [%]
clue that the whole process of word clustering
has no meaning in smoothing phrase transla- Baseline 28.3 52.2
tion models. + map-each random 28.9‡ 51.7‡
To see this more clearly, we keep the top-frequent 29.0‡ 51.5‡
model weights fixed over different systems same-countsum 28.8‡ 51.7‡
and run the same set of experiments. In this same-#words 28.9‡ 51.6‡
way, we focus only on the change of label count-bins 29.0‡ 51.4‡
vocabulary, removing the impact of nonde-
terministic model weight optimization. The Table 1: Translation results for various initializa-
results are given in Figure 3. tions of the clustering. 100 classes on both sides.
This time, the curves are even flatter, re-
sulting in only ± 0.1% B LEU difference over Table 1 shows the translation results
the iterations. More surprisingly, the models with the map-each model trained with these
trained with the initial clustering, i.e. when initializations—without running the cluster-
the clustering algorithm has not even started ing algorithm. We use the same set of model
yet, are on a par with those trained with weights used in Figure 3. We find that the
more optimized classes in terms of transla- initialization method also does not affect the
tion quality. translation performance. As an extreme case,
Page 68 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
random clustering is also a fine candidate for lation tasks: IWSLT 2012 German→English,
training the map-each model. WMT 2015 Finnish→English, WMT
3) Number of classes. This determines the vo- 2014 English→German, and WMT 2015
cabulary size of a label space, which even- English→Czech. We train 100 classes on each
tually adjusts the smoothing degree. Table side with 30 clustering iterations starting from the
2 shows the translation performance of the default (top-frequent) initialization.
map-each model with a varying number of Table 3 provides the corpus statistics of all
classes. Similarly as before, there is no se- datasets used. Note that a morphologically rich
rious performance gap among different word language is on the source side for the first two
classes, and POS tags and lemmas also com- tasks, and on the target side for the last two
form to this trend. tasks. According to the results (Table 4), the map-
However, we observe a slight but steady each model, which encourages backing off infre-
degradation of translation quality (≈ -0.2% quent words, performs consistently better (maxi-
B LEU) when the vocabulary size is larger mum +0.5% B LEU, -0.6% T ER) than the map-all
than a few hundreds. We also lose statisti- model in all cases.
cal significance for B LEU in these cases. The 5.4 Comparison of Training Data Size
reason could be: If the label space becomes
larger, it gets closer to the original vocabulary Lastly, we analyze the smoothing performance
and therefore the smoothing model provides for different training data sizes (Figure 4). The
less additional information to add to the stan- improvement of B LEU score over the baseline
dard phrase translation model. decreases drastically when the training data get
smaller. We argue that this is because the smooth-
ing models are only the additional scores for the
#vocab B LEU T ER
phrases seen in the training data. For smaller train-
(source) [%] [%]
ing data, we have more out-of-vocabulary (OOV)
Baseline 28.3 52.2 words in the test set, which cannot be handled by
+ map-each 100 29.0‡ 51.5‡ the presented models.
(word class) 200 28.9† 51.6‡
16 35
500 28.7 51.8‡
1000 28.7 51.8‡ 15
51.7‡ 13
BLEU [%]
10
The series of experiments show that the map-
each model performs very similar across vocab- 9 15
0 5 10 15 20 25
running words [M]
ulary size and its structure. From our internal ex-
periments, this argument also holds for the map-all
model. The results do not change even when we Figure 4: B LEU scores and OOV rates for
use a different clustering algorithm, e.g. bilingual the varying training data portion of WMT 2015
clustering (Och, 1999). For the translation perfor- Finnish→English data.
mance, the more important factor is the log-linear
model training to find an optimal set of weights for 6 Analysis
the smoothing models.
In Section 5.2, we have shown experimentally that
5.3 Comparison of Smoothing Models more optimized or more fine-grained classes do
Next, we compare the two smoothing models not guarantee better smoothing performance. We
by their performance in four different trans- now verify by examining translation outputs that
Page 69 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Table 3: Bilingual training data statistics for IWSLT 2012 German→English, WMT 2015
Finnish→English, WMT 2014 English→German, and WMT 2015 English→Czech tasks.
Table 4: Translation results for IWSLT 2012 German→English, WMT 2015 Finnish→English, WMT
2014 English→German, and WMT 2015 English→Czech tasks.
Table 5: Comparison of translation outputs for the smoothing models with different vocabularies. “op-
timized” denotes 30 iterations of the clustering algorithm, whereas “non-optimized” means the initial
(default) clustering.
the same level of performance is not by chance but cates that two systems are particularly effective in
due to similar hypothesis scoring across different a large common part of the test set, showing that
systems. they behaved analogously in the search process.
Given a test set, we compare its translations The numbers in this column are computed against
generated from different systems as follows. First, the map-each model setup trained with 100 opti-
for each translated set, we sort the sentences by mized word classes (first row). For all map-each
how much the sentence-level T ER is improved settings, the overlap is very large—around 90%.
over the baseline translation. Then, we select the To investigate further, we count how often the
top 200 sentences from this sorted list, which rep- two translations of a single input are identical (the
resent the main contribution to the decrease of last column). This is normalized by the number
T ER. In Table 5, we compare the top 200 T ER- of common input sentences in the top 200 lists be-
improved translations of the map-each model se- tween two systems. It is a straightforward measure
tups with different vocabularies. to see if two systems discriminate translation hy-
In the fourth column, we trace the input sen- potheses in a similar manner. Remarkably, all sys-
tences that are translated by the top 200 lists, and tems equipped with the map-each model produce
count how many of those inputs are overlapped exactly the same translations for the most part of
across given systems. Here, a large overlap indi- the top 200 T ER-improved sentences.
Page 70 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
We can see from this analysis that, even though For future work, we plan to perform a similar
a smoothing model is trained with essentially dif- set of comparative experiments on neural machine
ferent vocabularies, it helps the translation process translation systems.
in basically the same manner. For comparison, we
also compute the measures for a map-all model, Acknowledgments
which are far behind the high similarity among the
This paper has received funding from the Euro-
map-each models. Indeed, for smoothing phrase
pean Union’s Horizon 2020 research and innova-
translation models, changing the model structure
tion programme under grant agreement no 645452
for vocabulary reduction exerts a strong influence
(QT21).
in the hypothesis scoring, yet changing the vocab-
ulary does not.
References
7 Conclusion
Arianna Bisazza and Christof Monz. 2014. Class-
Reducing vocabulary using word-label mapping is based language modeling for translating into mor-
a simple and effective way of smoothing phrase phologically rich languages. In Proceedings of 25th
International Conference on Computational Lin-
translation models. By mapping each word in a guistics (COLING 2014), pages 1918–1927, Dublin,
phrase at a time, the translation quality can be im- Ireland, August.
proved by up to +0.7% B LEU and -0.8% T ER over
a standard phrase-based SMT baseline, which is Rami Botros, Kazuki Irie, Martin Sundermeyer, and
Hermann Ney. 2015. On efficient training of word
superior to Wuebker et al. (2013). classes and their application to recurrent neural net-
Our extensive comparison among various vo- work language models. In Proceedings of 16th An-
cabularies shows that different word-label map- nual Conference of the International Speech Com-
pings are almost equally effective for smoothing munication Association (Interspeech 2015), pages
1443–1447, Dresden, Germany, September.
phrase translation models. This allows us to use
any type of word-level label, e.g. a randomized Peter F. Brown, Peter V. deSouza, Robert L. Mer-
vocabulary, for the smoothing, which saves a con- cer, Vincent J. Della Pietra, and Jenifer C. Lai.
siderable amount of effort in optimizing the struc- 1992. Class-based n-gram models of natural lan-
guage. Computational Linguistics, 18(4):467–479,
ture and granularity of the label vocabulary. Our
December.
analysis on sentence-level T ER demonstrates that
the same level of performance stems from the Colin Cherry. 2013. Improved reordering for phrase-
analogous hypothesis scoring. based translation using sparse features. In Pro-
ceedings of 2013 Conference of the North American
We claim that this result emphasizes the fun-
Chapter of the Association for Computational Lin-
damental sparsity of the standard phrase transla- guistics: Human Language Technologies (NAACL-
tion model. Too many target phrase candidates HLT 2013), pages 22–31, Atlanta, GA, USA, June.
are originally undervalued, so giving them any
reasonable amount of extra probability mass, e.g. Nadir Durrani, Philipp Koehn, Helmut Schmid, and
Alexander Fraser. 2014. Investigating the useful-
by smoothing with random classes, is enough to ness of generalized word representations in smt. In
broaden the search space and improve translation Proceedings of 25th Annual Conference on Com-
quality. Even if we change a single parameter in putational Linguistics (COLING 2014), pages 421–
estimating the label space, it does not have a sig- 432, Dublin, Ireland, August.
nificant effect on scoring hypotheses, where many Michel Galley and Christopher D. Manning. 2008.
other models than the smoothed translation model, A simple and effective hierarchical phrase reorder-
e.g. language models, are involved with large ing model. In Proceedings of 2008 Conference on
weights. Nevertheless, an exact linguistic expla- Empirical Methods in Natural Language Process-
ing (EMNLP 2008), pages 848–856, Honolulu, HI,
nation is still to be discovered. USA, October.
Our results on varying training data show that
vocabulary reduction is more suitable for large- Barry Haddow, Matthias Huck, Alexandra Birch, Niko-
scale translation setups. This implies that OOV lay Bogoychev, and Philipp Koehn. 2015. The edin-
burgh/jhu phrase-based machine translation systems
handling is more crucial than smoothing phrase for wmt 2015. In Proceedings of 2016 EMNLP 10th
translation models for low-resource translation Workshop on Statistical Machine Translation (WMT
tasks. 2016), pages 126–133, Lisbon, Portugal, September.
Page 71 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Philipp Koehn and Hieu Hoang. 2007. Factored trans- source phrase-based and hierarchical statistical ma-
lation models. In Proceedings of 2007 Joint Con- chine translation. In Proceedings of 24th Inter-
ference on Empirical Methods in Natural Language national Conference on Computational Linguistics
Processing and Computational Natural Language (COLING 2012), pages 483–492, Mumbai, India,
Learning (EMNLP-CoNLL 2007), pages 868–876, December.
Prague, Czech Republic, June.
Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
Philipp Koehn, Franz Josef Och, and Daniel Marcu. mann Ney. 2013. Improving statistical machine
2003. Statistical phrase-based translation. In Pro- translation with word class models. In Proceedings
ceedings of 2003 Conference of the North American of 2013 Conference on Empirical Methods in Nat-
Chapter of the Association for Computational Lin- ural Language Processing (EMNLP 2013), pages
guistics on Human Language Technology (NAACL- 1377–1381, Seattle, USA, October.
HLT 2003), pages 48–54, Edmonton, Canada, May.
Richard Zens, Franz Josef Och, and Hermann Ney.
Philipp Koehn. 2004. Statistical significance tests for 2002. Phrase-based statistical machine translation.
machine translation evaluation. In Proceedings of In Matthias Jarke, Jana Koehler, and Gerhard Lake-
2004 Conference on Empirical Methods in Natural meyer, editors, 25th German Conference on Artifi-
Language Processing (EMNLP 2004), pages 388– cial Intelligence (KI2002), volume 2479 of Lecture
395, Barcelona, Spain, July. Notes in Artificial Intelligence (LNAI), pages 18–32,
Aachen, Germany, September. Springer Verlag.
Philipp Koehn. 2010. Statistical Machine Translation.
Cambridge University Press, New York, NY, USA.
Page 72 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
1169
Page 73 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
One of the oldest and still most popular exchange gration of these two models:
algorithm implementations is mkcls (Och, 1995)1 ,
which adds various metaheuristics to escape local P (wi |wi−1 , wi+1 ) , P (wi |ci ) (1)
optima. Botros et al. (2015) introduce their imple- · (λP (ci |wi−1 )
mentation of three exchange-based algorithms. Mar- + (1 − λ)P (ci |wi+1 ))
tin et al. (1998) and Müller and Schütze (2015)2
use trigrams within the exchange algorithm. Clark The interpolation weight λ for the forward direction
(2003) adds an orthotactic bias.3 alternates to 1 − λ every a iterations (i):
The previous algorithms use an unlexicalized (
1 − λ0 if i mod a = 0
(two-sided) language model: P (wi |wi−1 ) = λi := (2)
P (wi |ci ) P (ci |ci−1 ) , where the class ci of the pre- λ0 otherwise
dicted word wi is conditioned on the class ci−1 of Figure 1 illustrates the benefit of this λ-inversion to
the previous word wi−1 . Goodman (2001b) altered help escape local minima, with lower training set
this model so that ci is conditioned directly upon perplexity by inverting λ every four iterations:
wi−1 , hence: P (wi |wi−1 ) = P (wi |ci ) P (ci |wi−1 ) .
This new model fractionates the history more, but it ●
●
750 ●
Predictive
sulting partially lexicalized (one-sided) class model ● ● ● ● ● ● ● ● ● ● ● ●
Exchange
1170
Page 74 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Russian News Crawl, T=100M, |C|=800 N0 , and due to the power law distribution of the al-
gorithm’s access to these entropy terms, we can pre-
550
compute N · log N up to, say 10e+7, with minimal
memory requirements.8 This results in a consider-
able speedup of around 40% .
Perplexity
500
4 Experiments
450 Our experiments consist of both intrinsic and extrin-
sic evaluations. The intrinsic evaluation measures
dEx iD
i ne ev fin
e ne Re
v
fin
e the perplexity (PP) of two-sided class-based models
Pre +B efi +R Re efi i+ Re
+R +R
+B
iD
i+
+R
ev +B
iD
i+
Re
v+ for English and Russian, and the extrinsic evalua-
i D
+B tion measures B LEU scores of phrase-based MT of
Figure 2: Development set PP of combinations of improve- Russian↔English and Japanese↔English texts.
ments to predictive exchange (cf. §3), using 100M tokens of the
Russian News Crawl, with 800 word classes. 4.1 Class-based Language Model Evaluation
any word to any cluster—there is no hard constraint In this task we used 400, 800, and 1200 classes
that the more refined partitions be subsets of the ini- for English, and 800 classes for Russian. The data
tial coarser partitions. This gives more flexibility comes from the 2011–2013 News Crawl monolin-
in optimizing on log-likelihood, especially given the gual data of the WMT task.9 For these experiments
noise that naturally arises from coarser clusterings. the data was deduplicated, shuffled, tokenized, digit-
We explored cluster refinement over more stages conflated, and lowercased. In order to have a large
than just two, successively increasing the number of test set, one line per 100 of the resulting (shuffled)
clusters. We observed no improvement over the two- corpus was separated into the test set.10 The min-
stage method described above. imum count threshold was set to 3 occurrences in
Each B IRA component can be applied to any the training set. Table 1 shows information on the
exchange-based clusterer. The contributions of each resulting corpus.
of these are shown in Figure 2, which reports the
Corpus Tokens Types Lines
development set perplexities (PP) of all combina- English Train 1B 2M 42M
tions of B IRA components over the original pre- English Test 12M 197K 489K
dictive exchange algorithm. The data and con- Russian Train 550M 2.7M 31M
figurations are discussed in more detail in Sec- Russian Test 6M 284K 313K
tion 4. The greatest PP reduction is due to using
Table 1: Monolingual training & test set sizes.
lambda inversion (+Rev), followed by cluster re-
finement (+Refine), then interpolating the bidirec- The clusterings are evaluated on the PP of an ex-
tional models (+BiDi), with robust improvements ternal 5-gram unidirectional two-sided class-based
by using all three of these—an 18% reduction in language model (LM). The n-gram-order interpola-
perplexity over the predictive exchange algorithm. tion weights are tuned using a distinct development
We have found that both lambda inversion and clus- set of comparable size and quality as the test set.
ter refinement prevent early convergence at local op- Table 2 and Figure 3 show perplexity results us-
tima, while bidirectional models give immediate and ing a varying number of classes. Two-sided ex-
consistent training set PP improvements, but this is change gives the lowest perplexity across the board,
attenuated in a unidirectional evaluation. although this is within a two-sided LM evaluation.
We observed that most of the computation for the of a given word followed by a given class.
predictive exchange algorithm is spent on the log- 8
This was independently discovered in Botros et al. (2015).
arithm function, calculating δ ← δ − N (w, c) · 9
http://www.statmt.org/wmt15/
log N (w, c) .7 Since the codomain of N (w, c) is translation-task.html
10
The data setup script is at http://www.dfki.de/
7
δ is the change in log-likelihood, and N (w, c) is the count ˜jode03/naacl2016.sh .
1171
Page 75 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
● BIRA
Perplexity
Brown
180
Table 3: Clustering times (hours) of full training sets. Mkcls
2−Sided Exchange
implements two-sided exchange, and Phrasal implements one-
Pred. Exchange
160 ●
sided predictive exchange.
140 ●
1172
Page 76 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
is a substantial time bottleneck in MT pipelines with two-sided models generally give better perplexity in
large datasets. class-based LM experiments. Our paper shows that
We used data from the Workshop on Ma- B IRA-based predictive exchange clusters are com-
chine Translation 2015 (WMT15) Russian↔English petitive with two-sided clusters even in a two-sided
dataset and the Workshop on Asian Translation 2014 evaluation. They also give better perplexity than the
(WAT14) Japanese↔English dataset (Nakazawa et original predictive exchange algorithm and Brown
al., 2014). Both pairs used standard configurations, clustering.
like truecasing, MeCab segmentation for Japanese, The software is freely available at https://
MGIZA alignment, grow-diag-final-and phrase ex- github.com/jonsafari/clustercat .
traction, phrase-based Moses, quantized KenLM 5-
gram modified Kneser-Ney LMs, and M ERT tuning. Acknowledgements
We would like to thank Hermann Ney and Kazuki
|C| EN-RU RU-EN EN-JA JA-EN
10 20.8→20.9∗ 26.2→26.0 23.5→23.4 16.9→16.8 Irie, as well as the reviewers for their useful com-
50 21.0→21.2∗ 25.9→25.7 24.0→23.7∗ 16.9→16.9 ments. This work was supported by the QT21
100 20.4→21.1 25.9→25.8 23.8→23.5 16.9→17.0 project (Horizon 2020 No. 645452).
200 21.0→20.8 25.8→25.9 23.8→23.4 17.0→16.8
500 20.9→20.9 25.8→25.9∗ 24.0→23.8 16.8→17.1∗
1000 20.9→21.1 25.9→26.0∗∗ 23.6→23.5 16.9→17.1
References
Table 4: B LEU scores (mkcls→B IRA) and significance across Rami Botros, Kazuki Irie, Martin Sundermeyer, and
cluster sizes ( |C| ). Hermann Ney. 2015. On Efficient Training of
Word Classes and their Application to Recurrent Neu-
The B LEU score differences between using ral Network Language Models. In Proceedings of
mkcls and our B IRA implementation are small but INTERSPEECH-2015, pages 1443–1447, Dresden,
Germany.
there are a few statistically significant changes, us-
Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della
ing bootstrap resampling (Koehn, 2004). Table 4
Pietra, and Robert L. Mercer. 1993. The Mathematics
presents the B LEU score changes across varying of Statistical Machine Translation: Parameter Estima-
cluster sizes (*: p-value < 0.05, **: p-value < 0.01). tion. Computational Linguistics, 19(2):263–311.
M ERT tuning is quite erratic, and some of the B LEU Marie Candito and Djamé Seddah. 2010. Pars-
differences could be affected by noise in the tun- ing Word Clusters. In Proceedings of the NAACL
ing process in obtaining quality weight values. Us- HLT 2010 First Workshop on Statistical Parsing of
ing our B IRA implementation reduces the translation Morphologically-Rich Languages, pages 76–84, Los
model training time with 500 clusters from 20 hours Angeles, CA, USA.
using mkcls (of which 60% of the time is spent on Victor Chahuneau, Eva Schlinger, Noah A. Smith, and
clustering) to just 8 hours (of which 5% is spent on Chris Dyer. 2013. Translating into Morphologically
Rich Languages with Synthetic Phrases. In Proceed-
clustering).
ings of EMNLP, pages 1677–1687, Seattle, WA, USA.
5 Conclusion Colin Cherry. 2013. Improved Reordering for Phrase-
Based Translation using Sparse Features. In Proceed-
We have presented improvements to the predictive ings of NAACL-HLT, pages 22–31, Atlanta, GA, USA.
exchange algorithm that address longstanding draw- Alexander Clark. 2003. Combining Distributional and
backs of the original algorithm compared to other Morphological Information for Part of Speech Induc-
tion. In Proceedings of EACL, pages 59–66.
clustering algorithms, enabling new directions in us-
Leon Derczynski and Sean Chester. 2016. Generalised
ing large scale, high cluster-size word classes in
Brown Clustering and Roll-up Feature Generation. In
NLP. Proceedings of AAAI, Phoenix, AZ, USA.
Botros et al. (2015) found that the one-sided Nadir Durrani, Philipp Koehn, Helmut Schmid, and
model of the predictive exchange algorithm pro- Alexander Fraser. 2014. Investigating the Usefulness
duces better results for training LSTM-based lan- of Generalized Word Representations in SMT. In Pro-
guage models compared to two-sided models, while ceedings of Coling, pages 421–432, Dublin, Ireland.
1173
Page 77 of 78
Quality Translation 21
D1.5: Improved Learning for Machine Translation
Joshua Goodman. 2001a. Classes for Fast Maximum Proceedings of NAACL, pages 526–536, Denver, CO,
Entropy Training. In Proceedings of ICASSP, pages USA.
561–564. Toshiaki Nakazawa, Hideya Mino, Isao Goto, Sadao
Joshua T. Goodman. 2001b. A Bit of Progress in Lan- Kurohashi, and Eiichiro Sumita. 2014. Overview of
guage Modeling, Extended Version. Technical Report the first Workshop on Asian Translation. In Proceed-
MSR-TR-2001-72, Microsoft Research. ings of the Workshop on Asian Translation (WAT).
Spence Green, Daniel Cer, and Christopher Manning. Franz Josef Och and Hermann Ney. 2000. A Comparison
2014. An Empirical Comparison of Features and Tun- of Alignment Models for Statistical Machine Trans-
ing for Phrase-based Machine Translation. In Proc. of lation. In Proceedings of Coling, pages 1086–1090,
WMT, pages 466–476, Baltimore, MD, USA. Saarbrücken, Germany.
Reinhard Kneser and Hermann Ney. 1993. Im- Franz Josef Och. 1995. Maximum-Likelihood-
proved clustering techniques for class-based statis- Schätzung von Wortkategorien mit Verfahren der kom-
tical language modelling. In Proceedings of EU- binatorischen Optimierung. Bachelor’s thesis (Studi-
ROSPEECH’93, pages 973–976, Berlin, Germany. enarbeit), Friedrich-Alexander-Universität Erlangen-
Philipp Koehn and Hieu Hoang. 2007. Factored Trans- Nürnburg, Germany.
Slav Petrov. 2009. Coarse-to-Fine Natural Language
lation Models. In Proceedings of EMNLP-CoNLL,
Processing. Ph.D. thesis, University of California at
pages 868–876, Prague, Czech Republic.
Berkeley, Berkeley, CA, USA.
Philipp Koehn. 2004. Statistical significance tests for
Lev Ratinov and Dan Roth. 2009. Design Challenges
machine translation evaluation. In Proceedings of
and Misconceptions in Named Entity Recognition. In
EMNLP, pages 388–395.
Proc. of CoNLL, pages 147–155, Boulder, CO, USA.
Lingpeng Kong, Nathan Schneider, Swabha Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
Swayamdipta, Archna Bhatia, Chris Dyer, and 2011. Named Entity Recognition in Tweets: An Ex-
Noah A. Smith. 2014. A Dependency Parser for perimental Study. In Proceedings of EMNLP 2011,
Tweets. In Proceedings of EMNLP, pages 1001–1012, pages 1524–1534, Edinburgh, Scotland.
Doha, Qatar. Attapol Rutherford and Nianwen Xue. 2014. Discover-
Terry Koo, Xavier Carreras, and Michael Collins. 2008. ing Implicit Discourse Relations Through Brown Clus-
Simple Semi-supervised Dependency Parsing. In Pro- ter Pair Representation and Coreference Patterns. In
ceedings of ACL: HLT, pages 595–603, Columbus, Proc. of EACL, pages 645–654, Gothenburg, Sweden.
OH, USA. Sara Stymne. 2012. Clustered Word Classes for Pre-
Percy Liang. 2005. Semi-Supervised Learning for Natu- ordering in Statistical Machine Translation. In Pro-
ral Language. Master’s thesis, MIT. ceedings of the Joint Workshop on Unsupervised and
Sven Martin, Jörg Liermann, and Hermann Ney. 1998. Semi-Supervised Learning in NLP, pages 28–34, Avi-
Algorithms for Bigram and Trigram Word Clustering. gnon, France.
Speech Communication, 24(1):19–37. Oscar Täckström, Ryan McDonald, and Jakob Uszkor-
Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey eit. 2012. Cross-lingual Word Clusters for Direct
Dean. 2013. Efficient Estimation of Word Represen- Transfer of Linguistic Structure. In Proceedings of
tations in Vector Space. In Workshop Proceedings of NAACL:HLT, pages 477–487, Montréal, Canada.
the International Conference on Learning Representa- Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.
tions (ICLR), Scottsdale, AZ, USA. 2010. Word Representations: A Simple and General
Scott Miller, Jethran Guinness, and Alex Zamanian. Method for Semi-Supervised Learning. In Proceed-
2004. Name Tagging with Word Clusters and Discrim- ings of ACL, pages 384–394, Uppsala, Sweden.
inative Training. In Susan Dumais, Daniel Marcu, and Jakob Uszkoreit and Thorsten Brants. 2008. Distributed
Salim Roukos, editors, Proceedings of HLT-NAACL, Word Clustering for Large Scale Class-Based Lan-
pages 337–342, Boston, MA, USA. guage Modeling in Machine Translation. In Proc. of
Andriy Mnih and Geoffrey Hinton. 2009. A Scalable Hi- ACL: HLT, pages 755–762, Columbus, OH, USA.
erarchical Distributed Language Model. In D. Koller, Joern Wuebker, Stephan Peitz, Felix Rietig, and Hermann
D. Schuurmans, Y. Bengio, and L. Bottou, editors, Ad- Ney. 2013. Improving Statistical Machine Translation
vances in NIPS-21, volume 21, pages 1081–1088. with Word Class Models. In Proceedings of EMNLP,
Frederic Morin and Yoshua Bengio. 2005. Hierarchi- pages 1377–1381, Seattle, WA, USA.
cal Probabilistic Neural Network Language Model. In Andreas Zollmann and Stephan Vogel. 2011. A Word-
Proceedings of AISTATS, volume 5, pages 246–252. Class Approach to Labeling PSCFG Rules for Ma-
chine Translation. In Proceedings of ACL-HLT, pages
Thomas Müller and Hinrich Schütze. 2015. Robust Mor-
1–11, Portland, OR, USA.
phological Tagging with Word Representations. In
1174
Page 78 of 78