Attachment - 0 Copy 3

Ref.
Ares(2016)4040587 - 01/08/2016
This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”.
This project has received funding from the European Union’s Horizon 2020 program for ICT
under grant agreement no. 645452.
Deliverable D1.5
Improved Learning for Machine

Translation
Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH),

Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA),
Khalil Sima’an (UvA), and Jon Dehdari (DFKI)
Dissemination Level: Public
31st July, 2016

Quality Translation 21
D1.5: Improved Learning for Machine Translation
Grant agreement no. 645452

Project acronym QT21
Project full title Quality Translation 21
Type of action Research and Innovation Action
Coordinator Prof. Josef van Genabith (DFKI)
Start date, duration 1st February, 2015, 36 months
Dissemination level Public
Contractual date of delivery 31st July, 2016
Actual date of delivery 31st July, 2016
Deliverable number D1.5
Deliverable title Improved Learning for Machine Translation
Type Report
Status and version Final (Version 1.0)
Number of pages 78
Contributing partners CUNI, DFKI, RWTH, UVA
WP leader CUNI
Author(s) Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH),
Weiyue Wang (RWTH), Tamer Alkhouli (RWTH),
Yunsu Kim (RWTH), Miloš Stanojević (UvA),
Khalil Sima’an (UvA), and Jon Dehdari (DFKI)
EC project officer Susan Fraser
The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz
GmbH (DFKI), Germany
• Rheinisch-Westfälische Technische Hochschule Aachen
(RWTH), Germany
• Universiteit van Amsterdam (UvA), Netherlands
• Dublin City University (DCU), Ireland
• University of Edinburgh (UEDIN), United Kingdom
• Karlsruher Institut für Technologie (KIT), Germany
• Centre National de la Recherche Scientifique (CNRS), France
• Univerzita Karlova v Praze (CUNI), Czech Republic
• Fondazione Bruno Kessler (FBK), Italy
• University of Sheffield (USFD), United Kingdom
• TAUS b.v. (TAUS), Netherlands
• text & form GmbH (TAF), Germany
• TILDE SIA (TILDE), Latvia
• Hong Kong University of Science and Technology (HKUST),
Hong Kong
For copies of reports, updates on project activities and other QT21-related information, contact:
Prof. Stephan Busemann, DFKI GmbH stephan.busemann@dfki.de
Stuhlsatzenhausweg 3 Phone: +49 (681) 85775 5286
66123 Saarbrücken, Germany Fax: +49 (681) 85775 5338
Copies of reports and other material can also be accessed via the project’s homepage:
http://www.qt21.eu/
© 2016, The Individual Authors

No part of this document may be reproduced or transmitted in any form, or by any means,
electronic or mechanical, including photocopy, recording, or any information storage and retrieval
system, without permission from the copyright owner.
Page 2 of 78
Contents
1 Executive Summary 4
2 Joint Translation and Reordering Sequences 5
3 Alignment-Based Neural Machine Translation 5
4 The QT21/HimL Combined Machine Translation System 6
5 Improved BEER 6
6 Particle Swarm Optimization for MERT 6
7 CharacTER: Translation Edit Rate on Character Level 6
8 Bag-of-Words Input Features for Neural Network 7
9 Vocabulary Reduction for Phrase Table Smoothing 7
10 Faster and Better Word Classes for Word Alignment 8
References 8
Appendices 10
Appendix A Joint Translation and Reordering Sequences 10
Appendix B Alignment-Based Neural Machine Translation 21
Appendix C The QT21/HimL Combined Machine Translation System 33
Appendix D Beer for MT evaluation and tuning 45
Appendix E Particle Swarm Optimization Submission for WMT16 Tuning Task 51
Appendix F CharacTER: Translation Edit Rate on Character Level 58
Appendix G Exponentially Decaying Bag-of-Words Input Features 58
Appendix H Study on Vocabulary Reduction for Phrase Table Smoothing 65
Appendix I BIRA: Improved Predictive Exchange Word Clustering 73
Page 3 of 78
1 Executive Summary
This deliverable reports on the progress in Task 1.3 Improved Learning for Machine Translation,
as specified in the project proposal:
Task 1.3 Improved Learning for Machine Translation [M01–M36] (CUNI, DFKI,
RWTH, UVA) Experiments performed within this task will address the second ob-
jective, namely full structured prediction, with discriminative and integrated train-
ing. Focus will be on better correlation of the training criterion with the target
metrics, avoidance of overfitting by incorporating certain so far unused smoothing
techniques into the training process, and efficient algorithm to allow the use of all
training data in all phases of the training process.
The first focus point of Task 1.3 is constructing full structured predictions. We analyzed two
different approaches to achieve this. RWTH shows in Section 2 how to model bilingual sentence
pairs on a word level together with reordering information as one sequence – joint translation
and reordering (JTR) sequences. Paired with a phrased-based machine translation system it
gave a significant improvement of up to 2.2 Bleu points over the baseline. RWTH proposes
in Section 3 an alignment-based neural machine translation approach as an alternative to the
popular attention-based approach. They demonstrate competitive results on the IWSLT 2013
German→English and BOLT Chinese→English tasks.
Beyond systems designed within a unified framework with a single search for the best trans-
lation, we also experiment with a high-level combination of systems, which gives us the best
possible performance. In this joint effort the groups from the QT21 and the HimL project
managed to build the best system for the English→Romanian translation task of the ACL 2016
First Conference on Machine Translation (WMT 2016); see Section 4.
These systems depend on automatic ways to optimize parameters with a good training
criterion. The next focus of this Task is therefore to find a better correlation of the training
criterion with the target metrics. To this end, we organized and also participated in the Tuning
Task at WMT 2016. The organization of the task itself fits in WP4 and the details of the
task as a whole will be described in the corresponding deliverable there. In this deliverable, we
include our submissions to the Tuning Task. The Beer evaluation metric for MT was improved
by the UvA team to be both higher quality and significantly faster to compute, which makes
it competitive with Bleu for tuning; see Section 5. Another submission to the Tuning Task
was contributed by CUNI, who based the standard Mert on Particle Swarm Optimization; see
Section 6. Additionally RWTH introduced a character level TER, where the edit distance is
calculated on character level, while the shift edit is performed on word level. It was not tested
for tuning yet, but shows very high correlation with human judgment as described in Section 7.
Even with better metrics in place we still need to avoid overfitting to rare events while
learning from them. We developed multiple ways to avoid this by applying smoothing in different
variations. One approach is the usage of bag-of-words input for feed-forward neural networks
with individually-trained decay rates, which gave comparable results to LSTM networks as
described in Section 8 by RWTH. The second approach from RWTH is based on smoothing the
standard phrase translation probability by reducing the vocabulary with a word-label mapping.
That empirically showed that smoothing is not significantly affected by the choice of vocabulary
and is more effective for large-scale translation tasks; see Section 9.
Section 10 presents work by DFKI on smoothing using word classes for morphologically-
rich languages. These languages have large vocabularies, which increase training times and
data sparsity. The research developed multiple techniques to improve both the scalability and
quality of word clusters for word alignment.
Page 4 of 78
2 Joint Translation and Reordering Sequences

In joint work with WP2, RWTH introduced a method that converts bilingual sentence pairs and
their word alignments into joint translation and reordering (JTR) sequences. This combines
interdepending lexical and alignment dependencies into a single framework. A main advantage
of JTR sequences is that it can be modeled in a similar way as a language model, allowing it
to be used in well-known methods. RWTH tested three different methods:
• count based n-gram models with modified Kneser-Ney smoothing, a well-tested technique
for language modeling
• feedforward neural networks (FFNN), a model that should generalize better for unseen
events
• recurrent neural networks (RNN), good generalization, and can in principle take the com-
plete history into account
Comparisons between the count-based JTR model and the operation sequence model (OSM)
Durrani et al. (2013), both used in phrase-based decoding, showed that the JTR model per-
formed at least as good as OSM, with a slight advantage for JTR. In comparison to the OSM, the
JTR model operates on words, leading to a smaller vocabulary size. Moreover, it utilizes simpler
reordering structures without gaps and only requires one log-linear feature to be tuned, whereas
the OSM needs five. The strongest combination of count and neural network models yields an
improvement over the phrase-based system by up to 2.2 Bleu points on the German→English
IWSLT task. This combination also outperforms OSM by up to 1.2 Bleu points on the BOLT
Chinese→English tasks.
This work appeared as a long paper at EMNLP 2015 (Guta et al., 2015) and is available in
Appendix A.
3 Alignment-Based Neural Machine Translation

Neural machine translation (NMT) has emerged recently as a successful alternative to traditional
phrase-based systems. In NMT, neural networks are used to generate translation candidates
during decoding. This is in contrast to the integration of neural networks in phrase-based
systems, where the networks are used to score, but not to generate the translation hypotheses.
The neural networks used in NMT so far use an attention mechanism to allow the decoder to
pay more attention to certain parts of the source sentence when generating a target word. The
attention part is implemented as a probability distribution over the source positions.
In contrast to the attention mechanism, RWTH proposes an alignment-based approach using
the well-known hidden Markov Model (HMM). In this approach, translation is a generative
process of two steps: alignment and translation. In contrast to the attention-based case, this
method is more flexible as it allows training the alignment and translation model separately,
while still retaining the possibility of performing joint training using forced-decoding. We
combine the models using a log-linear framework and tune the model weights using minimum
error rate training (Mert) training. Our models use large target vocabularies of up to 149K
words, made feasible by a class-factored output layer. By limiting decoding to the top scoring
classes, we speed up decoding significantly. We demonstrate that the alignment-based approach
is another viable NMT approach, which is capable of slightly outperforming attention-based
NMT on the BOLT Chinese→English task.
This work was done in cooperation with WP2 and will appear at WMT 2016 (Alkhouli
et al., 2016) and is available in Appendix B.
Page 5 of 78
4 The QT21/HimL Combined Machine Translation System

In a joint effort of groups from RWTH Aachen University, LMU Munich, Charles University
in Prague, University of Edinburgh, University of Sheffield, Karlsruhe Institute of Technol-
ogy, LIMSI, University of Amsterdam, and Tilde, we provided the strongest system for the
English→Romanian translation task of the ACL 2016 First Conference on Machine Transla-
tion (WMT 2016). The submission is a system combination which combines twelve different
statistical machine translation systems built by the participating groups. The systems are
combined using RWTH’s system combination approach. It combines all translations into one
confusion network and extracts the most likely translation by finding the best path through the
network. The final official evaluation shows an improvement of 1.0 Bleu point by the system
combination compared to the best single system on newstest2016.
Examination of the translations by the single systems showed that the combination of all
systems also improved the number of produced words with the correct morphology.
This work was done in cooperation with WP2 and will appear at WMT 2016 (Peter et al.,
2016a) and is available in Appendix C.
5 Improved BEER
Beer is a trained evaluation metric with a linear model that combines features capturing
character n-grams and permutation trees introduced by UvA. This year the Beer learning
algorithm has been improved (linear SVM instead of logistic regression) and some features that
are relatively slow to compute were removed (paraphrasing, syntax and permutation trees)
which resulted in a very large speed-up. This speed-up was essential for fast tuning of MT
systems: tuning with Beer is now as fast as tuning with Bleu. The implementation is for
some languages now even more accurate. An additional change in Beer is that the usual
training for ranking is replaced by a compromise: the initial model is trained for ranking (RR)
with ranking SVM and then the output from SVM is scaled using a trained regression model
to approximate absolute judgment (DA).
This work on Beer appeared at WMT 2015 (Stanojević and Sima’an, 2015) and is available
in Appendix D.
6 Particle Swarm Optimization for MERT

In Kocur and Bojar (2016) (full text in Appendix E), CUNI describes a replacement of the
core optimization algorithm in the standard Minimum Error Rate Training (Och, 2003) with
Particle Swarm Optimization (PSO, Eberhart et al., 1995).
The expected benefit of PSO is the highly parallelizable structure of the algorithm. CUNI
implemented PSO for the Moses toolkit and took part in the WMT 16 Tuning Task to see how
the weights selected by PSO perform in extrinsic evaluation.
PSO indeed runs faster and delivers weights leading to Bleu scores very similar to the
Mert ones on the development set. Unfortunately, the manual evaluation for Tuning Task
ranks the system optimized with PSO lower, although still within the same top cluster of
participating systems that cannot be distinguished with statistical significance. (Almost all
systems participating in the Tuning Task ended up in this cluster.)
The experiments so far were limited to optimizing towards Bleu, but possibly, other MT
evaluation metrics would benefit from the PSO style of exploration of the parameter space more
than Bleu does.
7 CharacTER: Translation Edit Rate on Character Level

RWTH introduced CharacTER, a novel character-level metric inspired by the commonly applied
translation edit rate (TER). It is defined as the minimum number of character edits required to
Page 6 of 78
adjust a hypothesis, until it completely matches the reference, normalized by the length of the
hypothesis sentence. CharacTER calculates the character level edit distance while performing
the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is
considered to match a reference word and could be shifted, if the edit distance between them
is below a threshold value. The Levenshtein distance between the reference and the shifted
hypothesis sequence is computed at a character level. Also, the lengths of hypothesis sequences,
rather than reference sequences, are used for normalizing the edit distance, which effectively
counters the issue that shorter translations normally achieve lower TER.
The experimental results and evaluations showed that CharacTER represents a metric with
high human correlations on the system-level, especially for morphologically rich languages,
which benefit from the character level information. It outperforms other strong metrics for
translations out of English direction, while the concept is simple and straightforward.
This work was done in cooperation with WP2 and will appear at WMT 2016 (Wang et al.,
2016) and is available in Appendix F.
8 Bag-of-Words Input Features for Neural Network

Neural network models recently achieved consistent improvements in statistical machine trans-
lation. However, most networks only use simple one-hot encoded input vectors of words as their
input. RWTH investigated exponentially decaying bag-of-words input features for feed-forward
neural network translation models. Their work showed that the performance of bag-of-words
inputs can be improved by using and training decay rates along with other weight parameters.
The decay rates are used to fine-tune the effect of distant words on the current translation.
Different kinds of decay rates are investigated. The decay rate dependent on the aligned source
word performed best. It provides improvements on average by 0.5 Bleu points on three different
translation tasks (IWSLT 2013 German→English, WMT 2015 German→English, and BOLT
Chinese→English) on top of a state-of-the-art phrase-based system, a baseline which already
includes a neural network translation model. The model was able to slightly outperform a
bidirectional LSTM translation model on the given tasks and the experiments showed that
using a bag-of-word vector outperformed using a larger source-side window for the feed-forward
neural network on IWSLT.
This work will appear at ACL 2016 (Peter et al., 2016b) and is available in Appendix G.
9 Vocabulary Reduction for Phrase Table Smoothing

To address the sparsity problem of standard phrase translation, RWTH reduced the vocabulary
size by mapping words into a smaller label space, eventually training denser distributions. They
develop a smoothed translation model which maps each word in a phrase at a time to its respec-
tive label, yielding by up to 0.7 Bleu points over a standard phrase-based SMT baseline. They
evaluate the smoothing models using various vocabularies with different sizes and structures,
showing that different word-label mappings are almost equally effective for the vocabulary re-
duction. This allows the use of any type of word-level label, e.g. a randomized vocabulary, for
the smoothing, which saves a considerable amount of effort to optimize the structure and gran-
ularity of the label vocabulary. This result emphasizes the fundamental sparsity of the standard
phrase translation model. Tests of the vocabulary reduction in translation scenarios of different
scales showed that the smoothing works better with more parallel corpora. It implies that OOV
handling is more crucial than smoothing phrase translation models for low-resource translation
tasks.
This work will appear at WMT 2016 (Kim et al., 2016) and is available in Appendix H.
Page 7 of 78
10 Faster and Better Word Classes for Word Alignment

Word clusters are useful for many NLP tasks including training neural network language models,
but current increases in datasets are outpacing the ability of word clusterers to handle them.
Little attention has been paid thus far on inducing high-quality word clusters at a large scale.
The predictive exchange algorithm is quite scalable, but sometimes does not provide as good
perplexity as other slower clustering algorithms.
DFKI introduced the bidirectional, interpolated, refining, and alternating (BIRA) predictive
exchange algorithm. It improves upon the predictive exchange algorithm’s perplexity by up to
18%, giving it perplexities comparable to the slower two-sided exchange algorithm, and better
perplexities than the slower Brown clustering algorithm. The Bira implementation by DFKI
is fast, clustering a 2.5 billion token English News Crawl corpus in 3 hours. It also reduces
machine translation training time while preserving translation quality. The implementation is
portable and freely available.
This work appeared at NAACL 2016 (Dehdari et al., 2016) and is available in Appendix I.
References
Tamer Alkhouli, Gabriel Bretschner, Jan-Thorsten Peter, Mohammed Hethnawi, Andreas Guta,
and Hermann Ney. 2016. Alignment-based neural machine translation. In ACL 2016 First
Conference on Machine Translation. Berlin, Germany.
Jon Dehdari, Liling Tan, and Josef van Genabith. 2016. BIRA: Improved predictive exchange
word clustering. In Proceedings of the 2016 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies (NAACL).
Association for Computational Linguistics, San Diego, CA, USA, pages 1169–1174.
Nadir Durrani, Alexander Fraser, Helmut Schmid, Hieu Hoang, and Philipp Koehn. 2013. Can
Markov models over minimal translation units help phrase-based SMT? In Proceedings of
the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers). Sofia, Bulgaria, pages 399–405.
Russ C Eberhart, James Kennedy, et al. 1995. A new optimizer using particle swarm theory.
In Proceedings of the sixth international symposium on micro machine and human science.
New York, NY, volume 1, pages 39–43.
Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Joern Wuebker, and Hermann Ney. 2015.
A comparison between count and neural network models based on joint translation and re-
ordering sequences. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing. Lisbon, Portugal, pages 1401–1411.
Yunsu Kim, Andreas Guta, Joern Wuebker, and Hermann Ney. 2016. A comparative study on
vocabulary reduction for phrase table smoothing. In ACL 2016 First Conference on Machine
Translation. Berlin, Germany.
Viktor Kocur and Ondřej Bojar. 2016. Particle Swarm Optimization Submission for WMT16
Tuning Task. In Proceedings of the First Conference on Machine Translation, Volume 2:
Shared Task Papers. Association for Computational Linguistics, Berlin, Germany, pages 515–
521.
Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In
Proc. of the Association for Computational Linguistics. Sapporo, Japan.
Jan-Thorsten Peter, Tamer Alkhouli, Hermann Ney, Matthias Huck, Fabienne Braune, Alexan-
der Fraser, Aleš Tamchyna, Ondřej Bojar, Barry Haddow, Rico Sennrich, Frédéric Blain, Lu-
cia Specia, Jan Niehues, Alex Waibel, Alexandre Allauzen, Lauriane Aufrant, Franck Burlot,
Page 8 of 78
Elena Knyazeva, Thomas Lavergne, François Yvon, Stella Frank, and Marcis Pinnis. 2016a.
The QT21/HimL combined machine translation system. In ACL 2016 First Conference on
Machine Translation. Berlin, Germany.
Jan-Thorsten Peter, Weiyue Wang, and Hermann Ney. 2016b. Exponentially decaying bag-of-
words input features for feed-forward neural network in statistical machine translation. In
Annual Meeting of the Assoc. for Computational Linguistics. Berlin, Germany.
Miloš Stanojević and Khalil Sima’an. 2015. Evaluating MT systems with BEER. The Prague
Bulletin of Mathematical Linguistics 104:17–26.
Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. CharacTER:
Translation edit rate on character level. In ACL 2016 First Conference on Machine Transla-
tion. Berlin, Germany.
Page 9 of 78
A Joint Translation and Reordering Sequences
A Comparison between Count and Neural Network Models Based on

Joint Translation and Reordering Sequences
Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Joern Wuebker, Hermann Ney
Human Language Technology and Pattern Recognition Group
RWTH Aachen University
Aachen, Germany
{surname}@cs.rwth-aachen.de
Abstract This work proposes word-based translation

models that are potentially capable of capturing
We propose a conversion of bilingual long-range dependencies. We do this in two steps:
sentence pairs and the corresponding First, given bilingual sentence pairs and the asso-
word alignments into novel linear se- ciated word alignments, we convert the informa-
quences. These are joint translation tion into uniquely defined linear sequences. These
and reordering (JTR) uniquely defined sequenecs encode both word reordering and trans-
sequences, combining interdepending lation information. Thus, they are referred to as
lexical and alignment dependencies on joint translation and reordering (JTR) sequences.
the word level into a single framework. Second, we train an n-gram model with modi-
They are constructed in a simple manner fied Kneser-Ney smoothing (Chen and Goodman,
while capturing multiple alignments 1998) on the resulting JTR sequences. This yields
and empty words. JTR sequences can a model that fuses interdepending reordering and
be used to train a variety of models. translation dependencies into a single framework.
We investigate the performances of n- Although JTR n-gram models are closely re-
gram models with modified Kneser-Ney lated to the operation sequence model (OSM)
smoothing, feed-forward and recur- (Durrani et al., 2013b), there are three main dif-
rent neural network architectures when ferences. To begin with, the OSM employs min-
estimated on JTR sequences, and com- imal translation units (MTUs), which are essen-
pare them to the operation sequence tially atomic phrases. As the MTUs are extracted
model (Durrani et al., 2013b). Evalua- sentence-wise, a word can potentially appear in
tions on the IWSLT German→English, multiple MTUs. In order to avoid overlapping
WMT German→English and BOLT translation units, we define the JTR sequences
Chinese→English tasks show that JTR on the level of words. Consequently, JTR se-
models improve state-of-the-art phrase- quences have smaller vocabulary sizes than OSM
based systems by up to 2.2 BLEU. sequences and lead to models with less sparsity.
Moreover, we argue that JTR sequences offer a
1 Introduction simpler reordering approach than operation se-
quences, as they handle reorderings without the
Standard phrase-based machine translation (Och
need to predict gaps. Finally, when used as an
et al., 1999; Zens et al., 2002; Koehn et al., 2003)
additional model in the log-linear framework of
uses relative frequencies of phrase pairs to esti-
phrase-based decoding, an n-gram model trained
mate a translation model. The phrase table is ex-
on JTR sequences introduces only one single fea-
tracted from a bilingual text aligned on the word
ture to be tuned, whereas the OSM additionally
level, using e.g. GIZA++ (Och and Ney, 2003). Al-
uses 4 supportive features (Durrani et al., 2013b).
though the phrase pairs capture internal dependen-
Experimental results confirm that this simplifica-
cies between the source and target phrases aligned
tion does not make JTR models less expressive, as
to each other, they fail to model dependencies that
their performance is on par with the OSM.
extend beyond phrase boundaries. Phrase-based
decoding involves concatenating target phrases. Due to data sparsity, increasing the n-gram or-
The burden of ensuring that the result is linguisti- der of count-based models beyond a certain point
cally consistent falls on the language model (LM). becomes useless. To address this, we resort to neu-
Page 10 of 78
ral networks (NNs), as they have been successfully words and do not include reordering information.
applied to machine translation recently (Sunder- Durrani et al. (2011) developed the OSM which
meyer et al., 2014; Devlin et al., 2014). They are combined dependencies on bilingual word pairs
able to score any word combination without re- and reordering information into a single frame-
quiring additional smoothing techniques. We ex- work. It used an own decoder that was based on n-
periment with feed-forward and recurrent trans- grams of MTUs and predicted single translation or
lation networks, benefiting from their smoothing reordering operations. This was further advanced
capabilities. To this end, we split the linear se- in (Durrani et al., 2013a) by a decoder that was
quence into two sequences for the neural transla- capable of predicting whole sequences of MTUs,
tion models to operate on. This is possible due to similar to a phrase-based decoder. In (Durrani et
the simplicity of the JTR sequence. We show that al., 2013b), a slightly enhanced version of OSM
the count and NN models perform well on their was integrated into the log-linear framework of
own, and that combining them yields even better the Moses system (Koehn et al., 2007). Both the
results. BILM (Stewart et al., 2014) and the OSM (Durrani
In this work, we apply n-gram models with et al., 2014) can be smoothed using word classes.
modified Kneser-Ney smoothing during phrase- Guta et al. (2015) introduced the extended trans-
based decoding and neural JTR models in rescor- lation model (ETM), which operates on the word
ing. However, using a phrase-based system is not level and augments the IBM models by an addi-
required by the model, but only the initial step to tional bilingual word pair and a reordering opera-
demonstrate the strength of JTR models, which tion. It is implemented into the log-linear frame-
can be applied independently of the underlying de- work of a phrase-based decoder and shown to be
coding framework. While the focus of this work is competitive with a 7-gram OSM.
on the development and comparison of the models, The JTR n-gram models proposed within this
the long-term goal is to decode using JTR mod- work can be seen as an extension of the ETM.
els without the limitations introduced by phrases, Nevertheless, JTR models utilize linear sequences
in order to exploit the full potential of JTR mod- of dependencies and combine the translation of
els. The JTR models are estimated on word align- bilingual word pairs and reoderings into a sin-
ments, which we obtain using GIZA++ in this pa- gle model. The ETM, however, features separate
per. The future aim is to also generate improved models for the translation of individual words and
word alignments by a joint optimization of both reorderings and provides an explicit treatment of
the alignments and the models, similar to the train- multiple alignments. As they operate on linear se-
ing of IBM models (Brown et al., 1990; Brown et quences, JTR count models can be implemented
al., 1993). In the long run, we intend to achieve a using existing toolkits for n-gram language mod-
consistency between decoding and training using els, e.g. the KenLM toolkit (Heafield et al., 2013).
the introduced JTR models. An HMM approach for word-to-phrase align-
ments was presented in (Deng and Byrne, 2005),
2 Previous Work
showing performance similar to IBM Model 4 on
In order to address the downsides of the phrase the task of bitext alignment. Feng et al. (2013)
translation model, various approaches have been propose several models which rely only on the in-
taken. Mariño et al. (2006) proposed a bilingual formation provided by the source side and pre-
language model (BILM) that operates on bilin- dict reorderings. Contrastingly, JTR models in-
gual n-grams, with an own n-gram decoder re- corporate target information as well and predict
quiring monotone alignments. The lexical re- both translations and reorderings jointly in a sin-
ordering model introduced in (Tillmann, 2004) gle framework.
was integrated into phrase-based decoding. Crego Zhang et al. (2013) explore different Markov
and Yvon (2010) adapted the approach to BILMs. chain orderings for an n-gram model on MTUs
The bilingual n-grams are further advanced in in rescoring. Feng and Cohn (2013) present an-
(Niehues et al., 2011), where they operate on non- other generative word-based Markov chain trans-
monotone alignments within a phrase-based translation model which exploits a hierarchical Pitman-
lation framework. Compared to our JTR models, Yor process for smoothing, but it is only applied
their BILMs treat jointly aligned source words as to induce word alignments. Their follow-up work
minimal translation units, ignore unaligned source (Feng et al., 2014) introduces a Markov-model on
Page 11 of 78
MTUs, similar to the OSM described above. Algorithm 1 JTR Conversion Algorithm
Recently, neural machine translation has 1: procedure JTR CONVERSION( f1J , eI1 , bI1 )
emerged as an alternative to phrase-based decod- 2: gK1←0 /
ing, where NNs are used as standalone models to 3: // last translated source position j0
4: 0
j ←0
decode source input. In (Sutskever et al., 2014), 5: for i ← 1 to I do
a recurrent NN was used to encode a source 6: if ei is unaligned then
sequence, and output a target sentence once the 7: // align ei to the empty word ε
8: APPEND (gK 1 , hε, ei i)
source sentence was fully encoded in the network. 9: continue
The network did not have any explicit treatment 10: // ei is aligned to at least one source word
of alignments. Bahdanau et al. (2015) introduced 11: j ← first source position in bi
12: if j = j0 then
soft alignments as part of the network architecture. 13: // ei is aligned to the same f j as ei−1
In this work, we make use of hard alignments 14: APPEND (gK 1 , hσ , ei i)
instead, where we encode the alignments in the 15: continue
16: 0
if j 6= j + 1 then
source and target sequences, requiring no mod- 17: // alignment step is non-monotone
ifications of existing feed-forward and recurrent 18: REORDERINGS ( f 1J , bI1 , gK 0
1 , j , j)
NN architectures. Our feed-forward models are 19: // 1-to-1 translation: f j is aligned to ei
based on the architectures proposed in (Devlin et 20: APPEND (gK 1 , h f j , ei i)
al., 2014), while the recurrent models are based 21: j0 ← j
22: // generate all other f j that are also
on (Sundermeyer et al., 2014). Further recent 23: // aligned to the current target word ei
research on applying NN models for extended 24: for all remaining j in bi do
25: APPEND (gK 1 , h f j , σ i)
context was carried out in (Le et al., 2012; Auli
26: j0 ← j
et al., 2013; Hu et al., 2014). All of these works 27: // check last alignment step at sentence end
focus on lexical context and ignore the reordering 28: if j0 6= J then
aspect covered in our work. 29: // last alignment step is non-monotone
REORDERINGS ( f 1J , bI1 , gK 0
30: 1 , j , J + 1)
3 JTR Sequences 31: return gK 1
32:
The core idea of this work is the interpretation of 33: // called when a reordering class is appended
34: procedure REORDERINGS( f1J , bI1 , gK 0
a bilingual sentence pair and its word alignment 1 , j , j)
35: // check if the predecessor is unaligned
as a linear sequence of K joint translation and re- 36: if f j−1 is unaligned then
ordering (JTR) tokens gK1 . Formally, the sequence 37: // get unaligned predecessors
j−1
gK1 ( f1J , eI1 , bI1 ) is a uniquely defined interpretation 38: f j0 ← unaligned predecessors of f j
39: // check if the alignment step to the first
of a given source sentence f1J , its translation eI1 and 40: // unaligned predecessor is monotone
the inverted alignment bI1 , where bi denotes the 41: if j0 6= j0 + 1 then
ordered sequence of source positions j aligned to 42: // non-monotone: add reordering class
43: APPEND (gK 1 , ∆ j0 , j0 )
target position i. We drop the explicit mention of
44: // translate unaligned predecessors by ε
( f1J , eI1 , bI1 ) to allow for a better readability. Each 45: for f ← f j0 to f j−1 do
JTR token is either an aligned bilingual word pair 46: APPEND (gK 1 , h f , εi)
h f , ei or a reordering class ∆ j0 j . 47: else
Unaligned words on the source and target side 48: // non-monotone: add reordering class
49: APPEND (gK 1 , ∆ j0 , j )
are processed as if they were aligned to the empty
word ε. Hence, an unaligned source word f gener-
ates the token h f , εi, and an unaligned target word
e the token hε, ei. words. Similar to Feng and Cohn (2013), we clas-
Each word of the source and target sentences is sify the reordered source positions j0 and j by ∆ j0 j :
to appear in the corresponding JTR sequence ex-
actly once. For multiply-aligned target words e, 
the first source word f that is aligned to e gener- step backward (←),
 j = j0 − 1
ates the token h f , ei. All other source words f 0 , ∆ j0 j = jump forward (y), j > j0 + 1
that are also aligned to e, are processed as if they

jump backward (x), j < j0 − 1.

were aligned to the artificial word σ . Thus, each
of these f 0 generates a token h f 0 , σ i. The same
approach is applied to multiply-aligned source The reordering classes are illustrated in Figure 1.
Page 12 of 78
i
i i−1
i i−1
i−1
j j0 j0 j j j0
(a) step backward (←) (b) jump forward (y) (c) jump backward (x)
Figure 1: Overview of the different reordering classes in JTR sequences.
3.1 Sequence Conversion .

code
Algorithm 1 presents the formal conversion of a
your
bilingual sentence pair and its alignment into the
corresponding JTR sequence gK1 . At first, gK1 is enter
initialized by an empty sequence (line 2). For each ,
target position i = 1, . . . , I it is extended by at least field
one token. During the generation process, we store Command
the last visited source position j0 (line 4). If a tar- the
get word ei is in geben
Sie
im
Feld
Befehl
Ihren
Code
ein
.
• unaligned, we align it to the empty word ε
and append hε, ei i to the current gK1 (line 8),
• if it is aligned to the same f j as ei−1 , we only
add hσ , ei i (line 14), Figure 2: This example illustrates the JTR se-
• otherwise we append h f j , ei i (line 20) and quence gK1 for a German→English sentence pair
• in case there are more source words aligned including the word-to-word alignment.
to ei , we additionally append h f j , σ i for each
of these (line 24). token has to be generated right before h., .i is
Before a token h f j , ei i is generated, we have to generated. Therefore, there is no forward jump
check whether the alignment step from j0 to j is from hCode, codei to h., .i, but a monotone step
monotone (line 16). In case it is not, we have to to hein, εi followed by h., .i.
deal with reorderings (line 34). We define that 3.2 Training of Count Models
a token h f j−1 , εi is to be generated right before
the generation of the token containing f j . Thus, As the JTR sequence gK1 is a unique interpretation
if f j−1 is not aligned, we first determine the con- of a bilingual sentence pair and its alignment, the
probability p( f1J , eI1 , bI1 ) can be computed as:
tiguous sequence of unaligned predecessors f jj−1 0
(line 38). Next, if the step from j0 to j0 is not p( f1J , eI1 , bI1 ) = p(gK1 ). (1)
monotone, we add the corresponding reordering
The probability of gK1
can be factorized and ap-
class (line 43). Afterwards we append all h f j0 , εi
proximated by an n-gram model.
to h f j−1 , εi. If f j−1 is aligned, we do not have to K
process unaligned source words and only append p(gK1 ) = ∏ p(gk |gk−n+1
k−1
) (2)
the corresponding reordering class (line 49). k=1
Figure 2 illustrates the generation steps of a Within this work, we first estimate the Viterbi
JTR sequence, whose result is presented in Ta- alignment for the bilingual training data using
ble 1. The alignment steps are denoted by the ar- GIZA++ (Och and Ney, 2003). Secondly, the con-
rows connecting the alignment points. The first version presented in Algorithm 1 is applied to ob-
dashed alignment point indicates the hε, ,i token tain the JTR sequences, on which we estimate an
that is generated right after the hFeld, fieldi to- n-gram model with modified Kneser-Ney smooth-
ken. The second dashed alignment point indicates ing as described in (Chen and Goodman, 1998) us-
the hein, εi token, which corresponds to the un- ing the KenLM toolkit1 (Heafield et al., 2013).
aligned source word ein. Note, that the hein, εi 1 https://kheafield.com/code/kenlm/
Page 13 of 78
k gk sk tk De→En, this led to a baseline weaker by 0.2 B LEU

1 y δ y than the one described in Section 5. In order to
2 him, ini im in have an unconstrained and fair baseline, we there-
3 hσ , thei σ the
4 y δ y
after removed this constraint and forced such dele-
5 hBefehl, Commandi Befehl Command tion tokens to be generated at the end of the se-
6 ← δ ← quence. Hence, we accept that the JTR model
7 hFeld, fieldi Feld field
8 hε, ,i ε , might compute the wrong score in these special
9 x δ x cases.
10 hgeben, enteri geben enter
11 hSie, σ i Sie σ 4 Neural Networks
12 y δ y
13 hIhren, youri Ihren your
14 hCode, codei Code code Usually, smoothing techniques are applied to
15 hein, εi ein ε count-based models to handle unseen events. A
16 h., .i . . neural network does not suffer from this, as it
is able to score unseen events without additional
Table 1: The left side of this table presents the JTR smoothing techniques. In the following, we will
tokens gk corresponding to Figure 2. The right describe how to adapt JTR sequences to be used
side shows the source and target tokens sk and tk with feed-forward and recurrent NNs.
obtained from the JTR tokens gk . They are used The first thing to notice is the vocabulary size,
for the training of NNs (cf. Section 4). mainly determined by the number of bilingual
word pairs, which constituted atomic units in the
count-based models. NNs that compute probabil-
3.3 Integration into Phrase-based Decoding ity values at the output layer evaluate a softmax
Basically, each phrase table entry is annotated function that produces normalized scores that sum
with both the word alignment information, which up to unity. The softmax function is given by:
also allows to identify unaligned source words,
i−1
and the corresponding JTR sequence. The JTR eoei (e1 )
p(ei |ei−1
1 )= |V |
(3)
model is added to the log-linear framework as an i−1
∑w=1 eow (e1 )
additional n-gram model. Within the phrase-based
decoder, we extend each search state such that it where oei and ow are the raw unnormalized output
additionally stores the JTR model history. layer values for the words ei and w, respectively,
In comparison to the OSM, the JTR model does and |V | is the vocabulary size. The output layer
not predict gaps. Local reorderings within phrases is a function of the context ei−1
1 . Computing the
are handled implicitly. On the other hand, we rep- denominator is expensive for large vocabularies,
resent long-range reorderings between phrases by as it requires computing the output for all words.
the coverage vector and limit them by reordering Therefore, we split JTR tokens gk and use indi-
constraints. vidual words as input and output units, such that
Phrase-pairs ending with unaligned source the NN receives jumps, source and target words as
words at their right boundary prove to be a prob- input and outputs target words and jumps. Hence,
lem during decoding. As shown in Subsection 3.1, the resulting neural model is not a LM, but a trans-
the conversion from word alignments to JTR se- lation model with different input and output vo-
quences assumes that each token corresponding to cabularies. A JTR sequence gK1 is split into its
an unaligned source word is generated immedi- source and target parts sK1 and t1K . The construc-
ately before the token corresponding to the closest tion of the JTR source sequence sK1 proceeds as
aligned source position to its right. However, if a follows: Whenever a bilingual pair is encountered,
phrase ends with an unaligned f j as its rightmost the source word is kept and the target word is dis-
source word, the generation of the h f j , εi token has carded. In addition, all jump classes are replaced
to be postponed until the next word f j+1 is to be by a special token δ . The JTR target sequence t1K is
translated or, even worse, f j+1 has already been constructed similarly by keeping the target words
translated before. and dropping source words, and the jump classes
To address this issue, we constrained the phrase are also kept. Table 1 shows the JTR source and
table extraction to discard entries with unaligned target sequences corresponding to JTR sequence
source tokens at the right boundary. For IWSLT of Figure 2.
Page 14 of 78
Due to the design of the JTR sequence, pro- layer with the target input tk−1 as well, that is,
ducing the source and target JTR sequences is we aggregate the embeddings of the input source
straightforward. The resulting sequences can then word sk and the input target word tk−1 before they
be used with existing NN architectures, without are fed into the forward layer. Due to recurrency,
further modifications to the design of the net- the forward layer encodes the parts (t1k−1 , sk1 ), and
works. This results in powerful models that re- the backward layer encodes sKk , and together they
quire little effort to implement. encode (t1k−1 , sK1 ), which is used to score the out-
put target word tk . For the sake of comparison
4.1 Feed-forward Neural JTR to FFNN and count models, we also experiment
First, we will apply a feed-forward NN (FFNN) to with a recurrent model that does not include future
the JTR sequence. FFNN models resemble count- source information, this is obtained by replacing
based models in using a predefined limited context the term sK1 with sk1 in Eq. 5. It will be referred
size, but they do not encounter the same smooth- to as the unidirectional recurrent neural network
ing problems. In this work, we use a FFNN similar (URNN) model in the experiments.
to that proposed in (Devlin et al., 2014), defined Note that the JTR source and target sides
as: include jump information, therefore, the RNN
K model described above explicitly models reorder-
p(t1K |sK1 ) ≈ ∏ p(tk |tk−n
k−1 k
, sk−n ). (4) ing. In contrast, the models proposed in (Sunder-
k=1
meyer et al., 2014) do not include any jumps, and
It scores the JTR target word tk at position k us- hence do not provide an explicit way of includ-
ing the current source word sk , and the history of ing word reordering. In addition, the JTR RNN
n JTR source words. In addition, the n JTR target models do not require the use of IBM-1 lexica to
words preceding tk are used as context. The FFNN resolve multiply-aligned words. As discussed in
computes the score by looking up the vector em- Section 3, these cases are resolved by aligning the
beddings of the source and target context words, multiply-aligned word to the first word on the op-
concatenating them, then evaluating the rest of the posite side.
network. We reduce the output layer to a short- The integration of the NNs into the decoder is
list of the most frequent words, and compute word not trivial, due to the dependence on the target
class probabilities for the remaining words. context. In the case of RNNs, the context is un-
bounded, which would affect state recombination,
4.2 Recurrent Neural JTR and lead to less variety in the beam used to prune
Unlike feed-forward NNs, recurrent NNs (RNNs) the search space. Therefore, the RNN scores are
enable the use of unbounded context. Following computed using approximations instead (Auli et
(Sundermeyer et al., 2014), we use bidirectional al., 2013; Alkhouli et al., 2015). In (Alkhouli et
recurrent NNs (BRNNs) to capture the full JTR al., 2015), it is shown that approximate RNN inte-
source side. The BRNN uses the JTR target side gration into the phrase-based decoder has a slight
as well as the full JTR source side as context, and advantage over n-best rescoring. Therefore, we
it is given by: apply RNNs in rescoring in this work, and to al-
K low for a direct comparison between FFNNs and
p(t1K |sK1 ) = ∏ p(tk |t1k−1 , sK1 ) (5) RNNs, we apply FFNNs in rescoring as well.
k=1
This equation is realized by a network that uses 5 Evaluation

forward and backward recurrent layers to capture
We perform experiments on the large-
the complete source sentence. By a forward layer
scale IWSLT 20132 (Cettolo et al.,
we imply a recurrent hidden layer that processes
2014) German→English, WMT 20153
a given sequence from left to right, while a back-
German→English and the DARPA BOLT
ward layer does the processing backwards, from
Chinese→English tasks. The statistics for the
right to left. The source sentence is basically split
bilingual corpora are shown in Table 2. Word
at a given position k, then past and future represen-
alignments are generated with the GIZA++ toolkit
tations of the sentence are recursively computed
by the forward and backward layers, respectively. 2 http://www.iwslt2013.org
To include the target side, we provide the forward 3 http://www.statmt.org/wmt15/
Page 15 of 78
IWSLT WMT BOLT

German English German English Chinese English
Sentences 4.32M 4.22M 4.08M
Run. Words 108M 109M 106M 108M 78M 86M
Vocabulary 836K 792K 814K 773K 384K 817K
Table 2: Statistics for the bilingual training data of the IWSLT 2013 German→English, WMT 2015
German→English, and the DARPA BOLT Chinese→English translation tasks.
(Och and Ney, 2003). We use a standard phrase- 5.1 Tasks description
based translation system (Koehn et al., 2003). The domain of IWSLT consists of lecture-type
The decoding process is implemented as a beam talks presented at TED conferences which are also
search. All baselines contain phrasal and lexical available online4 . All systems are optimized on
smoothing models for both directions, word and the dev2010 corpus, named dev here. Some
phrase penalties, a distance-based reordering of the OSM and JTR systems are trained on the
model, enhanced low frequency features (Chen TED portions of the data containing 138K sen-
et al., 2011), a hierarchical reordering model tences. To estimate the 4-gram LM, we addi-
(HRM) (Galley and Manning, 2008), a word tionally make use of parts of the Shuffled News,
class LM (Wuebker et al., 2013) and an n-gram LDC English Gigaword and 109 -French-English
LM. The lexical and phrase translation models of corpora, selected by a cross-entropy difference cri-
all baseline systems are trained on all provided terion (Moore and Lewis, 2010). In total, 1.7 bil-
bilingual data. The log-linear feature weights are lion running words are taken for LM training. The
tuned with minimum error rate training (MERT) BOLT Chinese→English task is evaluated on the
(Och, 2003) on B LEU (Papineni et al., 2001). All “discussion forum” domain. The 5-gram LM is
systems are evaluated with MultEval (Clark et al., trained on 2.9 billion running words in total. The
2011). The reported B LEU scores are averaged in-domain data consists of a subset of 67.8K sen-
over three MERT optimization runs. tences and we used a set of 1845 sentences for tun-
ing. The evaluation set test1 contains 1844 and
test2 1124 sentences. For the WMT task, we
All LMs, OSMs and count-based JTR models
used the target side of the bilingual data and all
are estimated with the KenLM toolkit (Heafield et
monolingual data to train a pruned 5-gram LM on
al., 2013). The OSM and the count-based JTR
a total of 4.4 billion running words. We concate-
model are implemented in the phrasal decoder.
nated the newstest2011 and newstest2012
NNs are used only in rescoring. The 9-gram
corpora for tuning the systems.
FFNNs are trained with two hidden layers. The
short lists contain the 10k most frequent words, 5.2 Results
and all remaining words are clusterd into 1000
word classes. The projecton layer has 17 × 100 We start with the IWSLT 2013 German→ English
nodes, the first hidden layer 1000 and the sec- task, where we compare between the different JTR
ond 500. The RNNs have LSTM architectures. and OSM models. The results are shown in Ta-
The URNN has 2 hidden layers while the BRNN ble 3. When comparing the in-domain n-gram
has one forward, one backward and one addi- JTR model trained using Kneser-Ney smoothing
tional hidden layer. All layers have 200 nodes, (KN) to OSM, we observe that the n-gram KN
while the output layer is class-factored using 2000 JTR model improves the baseline by 1.4 B LEU
classes. For the count-based JTR model and OSM on both test and eval11. The OSM model
we tuned the n-gram size on the tuning set of each performs similarly, with a slight disadvantage on
task. For the full data, 7-grams were used for the eval11. In comparison, the FFNN of Eq. (4) im-
IWSLT and WMT tasks, and 8-grams for BOLT. proves the baseline by 0.7–0.9 B LEU, compared to
When using in-domain data, smaller n-gram sizes the slightly better 0.8–1.1 B LEU achieved by the
were used. All rescoring experiments used 1000- URNN. The difference between the FFNN and the
best lists without duplicates. 4 http://www.ted.com/
Page 16 of 78
data dev test eval11 part of the source input when scoring target words.
This information is not used by the KN model.
baseline full 33.3 30.8 35.7 Moreover, the BRNN is able to score word com-
+OSM TED 34.5 32.2 36.8 binations unseen in training, while the KN model
+FFNN TED 34.0 31.7 36.4 uses backing off to score unseen events.
+URNN TED 34.2 31.9 36.5 When training the KN, FFNN, and OSM mod-
+BRNN TED 34.4 32.1 36.8 els on the full data, we observe less gains in com-
+KN TED 34.6 32.2 37.1 parison to in-domain data training. However, com-
+BRNN TED 35.0 32.8 37.7 bining the KN models trained on in-domain and
+OSM full 34.1 31.6 36.5 full data gives additional gains, which suggests
+FFNN full 33.9 31.5 36.0 that although the in-domain model is more adapted
+KN full 34.2 31.6 36.6 to the task, it still can gain from out-of-domain
data. Adding the FFNN on top improves the com-
+KN TED 34.9 32.4 37.1 bination. Note here that the FFNN sees the same
+FFNN TED 35.2 32.7 37.2 information as the KN model, but the difference is
+FFNN full 35.1 32.7 37.2 that the NN operates on the word level rather than
+BRNN TED 35.5 33.0 37.4 the word-pair level. Second, the FFNN is able to
+BRNN TED 35.4 33.0 37.3 handle unseen sequences by design, without the
need for the backing off workaround. The BRNN
Table 3: Results measured in B LEU for the IWSLT improves the combination more than the FFNN,
German→English task. as the model captures an unbounded source and
target history in addition to an unbounded future
source context. Combining the KN, FFNN and
train data test1 test2
BRNN JTR models leads to an overall gain of 2.2
baseline 18.1 17.0 B LEU on both dev and test.
+OSM indomain 18.8 17.2 Next, we present the BOLT Chinese→English
+FFNN indomain 18.6 17.6 results, shown in Table 4. Comparing n-gram
+BRNN indomain 18.6 17.6 KN JTR and OSM trained on the in-domain data
+KN indomain 18.8 17.5 shows they perform equally well on test1, im-
proving the baseline by 0.7 B LEU, with a slight ad-
+OSM full 18.5 17.2
vantage for the JTR model on test2. The feed-
+FFNN full 18.4 17.4
forward and the recurrent in-domain networks
+KN full 18.8 17.3
yield the same results in comparison to each other.
+KN indomain 19.0 17.7 Training the OSM and JTR models on the full data
+FFNN full 19.2 18.3 yields slightly worse results than in-domain train-
+RNN indomain 19.3 18.4 ing. However, combining the two types of training
improves the results. This is shown when adding
Table 4: Results measured in B LEU for the BOLT the in-domain KN JTR model on top of the model
Chinese→English task. trained on full data, improving it by up to 0.4
B LEU. Rescoring with the feed-forward and the
recurrent network improves this even further, sup-
URNN is that the latter captures the unbounded porting the previous observation that the n-gram
source and target history that extends until the be- KN JTR and NNs complement each other. The
ginning of the sentences, giving it an advantage combination of the 4 models yields an overall im-
over the FFNN. The performance of the URNN provement of 1.2–1.4 B LEU.
can be improved by including the future part of the Finally, we compare KN JTR and OSM models
source sentence, as described in Eq. (5), resulting on the WMT German→English task in Table 5.
in the BRNN model. Next, we explore whether the The two models perform almost similar to each
models are additive. When rescoring the n-gram other. The JTR model improves the baseline by
KN JTR output with the BRNN, an additional im- up to 0.7 B LEU. Rescoring the KN JTR with the
provement of 0.6 B LEU is obtained. There are two FFNN improves it by up to 0.3 B LEU leading to an
reasons for this: The BRNN includes the future overall improvement between 0.5 and 1.0 B LEU.
Page 17 of 78
newstest 6 Conclusion
2013 2014 2015
We introduced a method that converts bilingual
baseline 28.1 28.6 29.4 sentence pairs and their word alignments into joint
+OSM 28.6 28.9 30.0 translation and reordering (JTR) sequences. They
+FFNN 28.7 28.9 29.7 combine interdepending lexical and alignment de-
+KN 28.8 28.9 29.9 pendencies into a single framework. A main ad-
+FFNN 29.1 29.1 30.0 vantage of JTR sequences is that a variety of mod-
els can be trained on them. Here, we have esti-
mated n-gram models with modified Kneser-Ney
Table 5: Results measured in B LEU for the WMT smoothing, FFNN and RNN architectures on JTR
German→English task. sequences.
We compared our count-based JTR model to the
OSM, both used in phrase-based decoding, and
5.3 Analysis showed that the JTR model performed at least as
good as OSM, with a slight advantage for JTR. In
To investigate the effect of including jump infor- comparison to the OSM, the JTR model operates
mation in the JTR sequence, we trained a BRNN on words, leading to a smaller vocabulary size.
using jump classes and another excluding them. Moreover, it utilizes simpler reordering structures
The BRNNs were used in rescoring. Below, we without gaps and only requires one log-linear fea-
demonstrate the difference between the systems: ture to be tuned, whereas the OSM needs 5. Due
to the flexibility of JTR sequences, we can ap-
source: wir kommen später noch auf diese Leute zurück . ply them also to FFNNs and RNNs. Utilizing
reference: We’ll come back to these people later . two count models and applying both networks in
Hypothesis 1: rescoring gains the overall highest improvement
JTR source: wir kommen δ zurück δ später noch auf over the phrase-based system by up to 2.2 B LEU,
diese Leute δ . on the German→English IWSLT task. The com-
JTR target: we come y back x later σ to these people bination outperforms OSM by up to 1.2 B LEU on
y. the BOLT Chinese→English tasks.
Hypothesis 2: The JTR models are not dependent on the
JTR source: wir kommen später noch auf diese Leute phrase-based framework, and one of the long-
zurück . term goals is to perform standalone decoding with
JTR target: we come later σ on these guys back . the JTR models independently of phrase-based
systems. Without the limitations introduced by
phrases, we believe that JTR models could per-
Note the German verb “zurückkommen”, which form even better. In addition, we aim to use JTR
is split into “kommen” and “zurück”. German models to obtain the alignment, which would then
places “kommen” at the second position and be used to train the JTR models in an iterative
“zurück” towards the end of the sentence. Unlike manner, achieving consistency and hoping for im-
German, the corresponding English phrase “come proved models.
back” has the words adjacent to each other. We
found that the system including jumps prefers the Acknowledgements
correct translation of the verb, as shown in Hy-
pothesis 1 above. The system translates “kom- This work has received funding from the Euro-
men” to “come”, jumps forward to “zurück”, pean Union’s Horizon 2020 research and innova-
translates it to “back”, then jumps back to continue tion programme under grant agreement no 645452
translating the word “später”. In contrast, the sys- (QT21). This material is partially based upon
tem that excludes jump classes is blind to this sep- work supported by the DARPA BOLT project un-
aration of words. It favors Hypothesis 2 which is der Contract No. HR0011- 12-C-0015. Any opin-
a strictly monotone translation of the German sen- ions, findings and conclusions or recommenda-
tence. This is also reflected by the B LEU scores, tions expressed in this material are those of the
where we found the system including jump classes authors and do not necessarily reflect the views of
outperforming the one without by up to 0.8 B LEU. DARPA.
Page 18 of 78
References Yonggang Deng and William Byrne. 2005. Hmm word

and phrase alignment for statistical machine transla-
Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015. tion. In Proceedings of Human Language Technol-
Investigations on phrase-based decoding with recur- ogy Conference and Conference on Empirical Meth-
rent neural network language and translation mod- ods in Natural Language Processing, pages 169–
els. In Proceedings of the EMNLP 2015 Tenth 176, Vancouver, British Columbia, Canada, October.
Workshop on Statistical Machine Translation, Lis-
bon, Portugal, September. to appear. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas
Michael Auli, Michel Galley, Chris Quirk, and Geof- Lamar, Richard Schwartz, and John Makhoul. 2014.
frey Zweig. 2013. Joint Language and Translation Fast and Robust Neural Network Joint Models for
Modeling with Recurrent Neural Networks. In Con- Statistical Machine Translation. In 52nd Annual
ference on Empirical Methods in Natural Language Meeting of the Association for Computational Lin-
Processing, pages 1044–1054, Seattle, USA, Octo- guistics, pages 1370–1380, Baltimore, MD, USA,
ber. June.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Nadir Durrani, Helmut Schmid, and Alexander Fraser.
gio. 2015. Neural machine translation by jointly 2011. A joint sequence translation model with in-
learning to align and translate. In International Con- tegrated reordering. In Proceedings of the 49th An-
ference on Learning Representations, San Diego, nual Meeting of the Association for Computational
Calefornia, USA, May. Linguistics: Human Language Technologies, pages
1045–1054, Portland, Oregon, USA, June.
Peter F. Brown, John Cocke, Stephan A. Della Pietra,
Vincent J. Della Pietra, Fredrick Jelinek, John D. Nadir Durrani, Alexander Fraser, and Helmut Schmid.
Lafferty, Robert L. Mercer, and Paul S. Rossin. 2013a. Model with minimal translation units, but
1990. A Statistical Approach to Machine Transla- decode with phrases. In Proceedings of the 2013
tion. Computational Linguistics, 16(2):79–85, June. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Peter F. Brown, Stephan A. Della Pietra, Vincent J. Language Technologies, pages 1–11, Atlanta, Geor-
Della Pietra, and Robert L. Mercer. 1993. The gia, June.
Mathematics of Statistical Machine Translation: Pa-
rameter Estimation. Computational Linguistics, Nadir Durrani, Alexander Fraser, Helmut Schmid,
19(2):263–311, June. Hieu Hoang, and Philipp Koehn. 2013b. Can
markov models over minimal translation units help
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa phrase-based smt? In Proceedings of the 51st An-
Bentivogli, and Marcello Federico. 2014. Report on nual Meeting of the Association for Computational
the 11th iwslt evaluation campaign, iwslt 2014. In Linguistics (Volume 2: Short Papers), pages 399–
International Workshop on Spoken Language Trans- 405, Sofia, Bulgaria, August.
lation, pages 2–11, Lake Tahoe, CA, USA, Decem-
ber. Nadir Durrani, Philipp Koehn, Helmut Schmid, and
Alexander Fraser. 2014. Investigating the useful-
Stanley F. Chen and Joshuo Goodman. 1998. An
ness of generalized word representations in smt. In
Empirical Study of Smoothing Techniques for Lan-
COLING, Dublin, Ireland, August.
guage Modeling. Technical Report TR-10-98, Com-
puter Science Group, Harvard University, Cam-
Yang Feng and Trevor Cohn. 2013. A markov
bridge, MA, August.
model of machine translation using non-parametric
Boxing Chen, Roland Kuhn, George Foster, and bayesian inference. In 51st Annual Meeting of the
Howard Johnson. 2011. Unpacking and transform- Association for Computational Linguistics, pages
ing feature functions: New ways to smooth phrase 333–342, Sofia, Bulgaria, August.
tables. In MT Summit XIII, pages 269–275, Xiamen,
China, September. Minwei Feng, Jan-Thorsten Peter, and Hermann Ney.
2013. Advancements in reordering models for sta-
Jonathan H. Clark, Chris Dyer, Alon Lavie, and tistical machine translation. In Annual Meeting
Noah A. Smith. 2011. Better hypothesis test- of the Assoc. for Computational Linguistics, pages
ing for statistical machine translation: Controlling 322–332, Sofia, Bulgaria, August.
for optimizer instability. In 49th Annual Meet-
ing of the Association for Computational Linguis- Yang Feng, Trevor Cohn, and Xinkai Du. 2014. Fac-
tics:shortpapers, pages 176–181, Portland, Oregon, tored markov translation with robust modeling. In
June. Proceedings of the Eighteenth Conference on Com-
putational Natural Language Learning, pages 151–
Josep Maria Crego and François Yvon. 2010. Improv- 159, Ann Arbor, Michigan, June.
ing reordering with linguistically informed bilingual
n-grams. In Proceedings of the 23rd International Michel Galley and Christopher D. Manning. 2008.
Conference on Computational Linguistics (Coling A simple and effective hierarchical phrase reorder-
2010: Posters), pages 197–205, Beijing, China. ing model. In Proceedings of the Conference on
Page 19 of 78
Empirical Methods in Natural Language Process- Franz J. Och, Christoph Tillmann, and Hermann Ney.
ing, EMNLP ’08, pages 848–856, Stroudsburg, PA, 1999. Improved Alignment Models for Statistical
USA. Association for Computational Linguistics. Machine Translation. In Proc. Joint SIGDAT Conf.
on Empirical Methods in Natural Language Pro-
Andreas Guta, Joern Wuebker, Miguel Graça, Yunsu cessing and Very Large Corpora, pages 20–28, Uni-
Kim, and Hermann Ney. 2015. Extended translation versity of Maryland, College Park, MD, June.
models in phrase-based decoding. In Proceedings
of the EMNLP 2015 Tenth Workshop on Statistical Franz Josef Och. 2003. Minimum Error Rate Training
Machine Translation, Lisbon, Portugal, September. in Statistical Machine Translation. In Proc. of the
to appear. 41th Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 160–167, Sapporo,
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Japan, July.
Clark, and Philipp Koehn. 2013. Scalable modi-
fied Kneser-Ney language model estimation. In Pro- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
ceedings of the 51st Annual Meeting of the Associa- Jing Zhu. 2001. Bleu: a Method for Automatic
tion for Computational Linguistics, pages 690–696, Evaluation of Machine Translation. IBM Research
Sofia, Bulgaria, August. Report RC22176 (W0109-022), IBM Research Di-
vision, Thomas J. Watson Research Center, P.O. Box
Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao. 218, Yorktown Heights, NY 10598, September.
2014. Minimum translation modeling with recur-
rent neural networks. In Proceedings of the 14th Darelene Stewart, Roland Kuhn, Eric Joanis, and
Conference of the European Chapter of the Associ- George Foster. 2014. Coarse split and lump bilin-
ation for Computational Linguistics, pages 20–29, gual languagemodels for richer source information
Gothenburg, Sweden, April. in smt. In AMTA, Vancouver, BC, Canada, October.
P. Koehn, F. J. Och, and D. Marcu. 2003. Statisti- Martin Sundermeyer, Tamer Alkhouli, Wuebker Wue-
cal Phrase-Based Translation. In Proceedings of the bker, and Hermann Ney. 2014. Translation Model-
2003 Meeting of the North American chapter of the ing with Bidirectional Recurrent Neural Networks.
Association for Computational Linguistics (NAACL- In Conference on Empirical Methods on Natural
03), pages 127–133, Edmonton, Alberta. Language Processing, pages 14–25, Doha, Qatar,
October.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi, Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le.
Brooke Cowan, Wade Shen, Christine Moran, 2014. Sequence to sequence learning with neural
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra networks. In Advances in Neural Information Pro-
Constantine, and Evan Herbst. 2007. Moses: Open cessing Systems 27, pages 3104–3112.
Source Toolkit for Statistical Machine Translation.
pages 177–180, Prague, Czech Republic, June. Christoph Tillmann. 2004. A unigram orientation
model for statistical machine translation. In Pro-
Hai Son Le, Alexandre Allauzen, and François Yvon. ceedings of HLT-NAACL 2004: Short Papers, HLT-
2012. Continuous Space Translation Models with NAACL-Short ’04, pages 101–104, Stroudsburg,
Neural Networks. In Conference of the North Amer- PA, USA.
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
39–48, Montreal, Canada, June. mann Ney. 2013. Improving statistical machine
translation with word class models. In Conference
José B Mariño, Rafael E Banchs, Josep M Crego, Adrià on Empirical Methods in Natural Language Pro-
de Gispert, Patrik Lambert, José A R Fonollosa, and cessing, pages 1377–1381, Seattle, USA, October.
Marta R Costa-jussà. 2006. N-gram-based Machine
Translation. Comput. Linguist., 32(4):527–549, De- Richard Zens, Franz Josef Och, and Hermann Ney.
cember. 2002. Phrase-Based Statistical Machine Transla-
tion. In 25th German Conf. on Artificial Intelligence
R.C. Moore and W. Lewis. 2010. Intelligent Selection (KI2002), pages 18–32, Aachen, Germany, Septem-
of Language Model Training Data. In ACL (Short ber.
Papers), pages 220–224, Uppsala, Sweden, July.
Hui Zhang, Kristina Toutanova, Chris Quirk, and Jian-
Jan Niehues, Teresa Herrmann, Stephan Vogel, and feng Gao. 2013. Beyond left-to-right: Multiple de-
Alex Waibel, 2011. Proceedings of the Sixth Work- composition structures for smt. In Proceedings of
shop on Statistical Machine Translation, chapter the 2013 Conference of the North American Chap-
Wider Context by Using Bilingual Language Mod- ter of the Association for Computational Linguis-
els in Machine Translation, pages 198–206. tics: Human Language Technologies, pages 12–21,
Atlanta, Georgia, June.
Franz J. Och and Hermann Ney. 2003. A System-
atic Comparison of Various Statistical Alignment
Models. Computational Linguistics, 29(1):19–51,
March.
Page 20 of 78
B Alignment-Based Neural Machine Translation
Alignment-Based Neural Machine Translation
Tamer Alkhouli, Gabriel Bretschner,

Jan-Thorsten Peter, Mohammed Hethnawi, Andreas Guta and Hermann Ney
RWTH Aachen University, Aachen, Germany
Abstract els into phrase-based decoding, where the mod-

els are used to score phrasal candidates hypothe-
Neural machine translation (NMT) has sized by the decoder (Vaswani et al., 2013; Devlin
emerged recently as a promising statis- et al., 2014; Alkhouli et al., 2015). The second
tical machine translation approach. In approach is referred to as neural machine trans-
NMT, neural networks (NN) are directly lation, where neural models are used to hypoth-
used to produce translations, without re- esize translations, word by word, without relying
lying on a pre-existing translation frame- on a pre-existing framework. In comparison to the
work. In this work, we take a step to- former approach, NMT does not restrict NNs to
wards bridging the gap between conven- predetermined translation candidates, and it does
tional word alignment models and NMT. not depend on word alignment concepts that have
We follow the hidden Markov model been part of building state-of-the-art phrase-based
(HMM) approach that separates the align- systems. In such systems, the HMM and the IBM
ment and lexical models. We propose models developed more than two decades ago are
a neural alignment model and combine used to produce Viterbi word alignments, which
it with a lexical neural model in a log- are used to build standard phrase-based systems.
linear framework. The models are used Existing NMT systems either disregard the no-
in a standalone word-based decoder that tion of word alignments entirely (Sutskever et al.,
explicitly hypothesizes alignments during 2014), or rely on a probabilistic notion of align-
search. We demonstrate that our system ments (Bahdanau et al., 2015) independent of the
outperforms attention-based NMT on two conventional alignment models.
tasks: IWSLT 2013 German→English and Most recently, Cohn et al. (2016) designed neu-
BOLT Chinese→English. We also show ral models that incorporate concepts like fertility
promising results for re-aligning the train- and Markov conditioning into their structure. In
ing data using neural models. this work, we also focus on the question whether
conventional word alignment concepts can be used
1 Introduction
for NMT. In particular, (1) We follow the HMM
Neural networks have been gaining a lot of at- approach to separate the alignment and translation
tention recently in areas like speech recognition, models, and use neural networks to model align-
image recognition and natural language process- ments and translation. (2) We introduce a lexical-
ing. In machine translation, NNs are applied in ized alignment model to capture source reorder-
two main ways: In N -best rescoring, the neural ing information. (3) We bootstrap the NN training
model is used to score the first-pass decoding out- using Viterbi word alignments obtained from the
put, limiting the model to a fixed set of hypotheses HMM and IBM model training, and use the trained
(Le et al., 2012; Sundermeyer et al., 2014a; Hu et neural models to generate new alignments. The
al., 2014; Guta et al., 2015). The second approach new alignments are then used to re-train the neural
integrates the NN into decoding, potentially allow- networks. (4) We design an alignment-based de-
ing it to directly determine the search space. coder that hypothesizes the alignment path along
There are two approaches to use neural mod- with the associated translation. We show com-
els in decoding. The first integrates the mod- petitive results in comparison to attention-based
Page 21 of 78
models on the IWSLT 2013 German→English and tasks like IWSLT English→German 2015 (Luong
BOLT Chinese→English task. and Manning, 2015).
In this work, we follow the same standalone
1.1 Motivation
neural translation approach. However, we have
Attention-based NMT computes the translation a different treatment of alignments. While the
probability depending on an intermediate compu- attention-based soft-alignment model computes
tation of an alignment distribution. The alignment an alignment distribution as an intermediate step
distribution is used to choose the positions in the within the neural model, we follow the hard align-
source sentence that the decoder attends to dur- ment concept used in phrase extraction. We sepa-
ing translation. Therefore, the alignment model rate the alignment model from the lexical model,
can be considered as an implicit part of the trans- and train them independently. At translation time,
lation model. On the other hand, separating the the decoder hypothesizes and scores the alignment
alignment model from the lexical model has its path in addition to the translation.
own advantages: First, this leads to more flexi- Cohn et al. (2016) introduce several modifi-
bility in modeling and training: not only can the cations to the attention-based model inspired by
models be trained separately, but they can also traditional word alignment concepts. They mod-
have different model types, e.g. neural models, ify the network architecture, adding a first-order
count-based models, etc. Second, the separation dependence by making the attention vector com-
avoids propagating errors from one model to the puted for a target position directly dependent on
other. In attention-based systems, the translation that of the previous position. Our alignment model
score is based on the alignment distribution, which has a first-order dependence that takes place at the
risks propagating errors from the alignment part to input and output of the model, rather than an ar-
the translation part. Third, using separate models chitectural modification of the neural network.
makes it possible to assign them different weights.
Yang et al. (2013) use NN-based lexical and
We exploit this and use a log-linear framework
alignment models, but they give up the probabilis-
to combine them. We still retain the possibility
tic interpretation and produce unnormalized scores
of joint training, which can be performed flexibly
instead. Furthermore, they model alignments us-
by alternating between model training and align-
ing a simple distortion model that has no depen-
ment generation. The latter can be performed us-
dence on lexical context. The models are used to
ing forced-decoding.
produce new alignments which are in turn used
In contrast to the count-based models used in to train phrase systems. This leads to no sig-
HMMs, we use neural models, which allow cov- nificant difference in terms of translation perfor-
ering long context without having to explicitly ad- mance. Tamura et al. (2014) propose a lexicalized
dress the smoothing problem that arises in count- RNN alignment model. The model still produces
based models. non-probabilistic scores, and is used to generate
word alignments used to train phrase-based sys-
2 Related Work
tems. In this work, we develop a feed-forward
Most recently, NNs have been trained on large neural alignment model that computes probabilis-
amounts of data, and applied to translate indepen- tic scores, and use it directly in standalone de-
dent of the phrase-based framework. Sutskever et coding, without constraining it to the phrase-based
al. (2014) introduced the pure encoder-decoder ap- framework. In addition, we use the neural models
proach, which avoids the concept of word align- to produce alignments that are used to re-train the
ments. Bahdanau et al. (2015) introduced an atten- same neural models.
tion mechanism to the encoder-decoder approach, Schwenk (2012) proposed a feed-forward net-
allowing the decoder to attend to certain source work that computes phrase scores offline, and the
words. This method was refined in (Luong et al., scores were added to the phrase table of a phrase-
2015) to allow for local attention, which makes the based system. Offline phrase scoring was also
decoder attend to representations of source words done in (Alkhouli et al., 2014) using semantic
residing within a window. These translation mod- phrase features obtained using simple neural net-
els have shown competitive results, outperforming works. In comparison, our work does not rely on
phrase-based systems when using ensembles on the phrase-based system, rather, the neural net-
Page 22 of 78
works are used to hypothesize translation candi- 4 Neural Network Models

dates directly, and the scores are computed online
There are two common network architectures used
during decoding.
in machine translation: feed-forward NNs (FFNN)
We use the feed-forward joint model introduced
and recurrent NNs (RNN). In this section we will
in (Devlin et al., 2014) as a lexical model, and in-
discuss alignment-based feed-forward and recur-
troduce a lexicalized alignment model based on
rent neural networks. These networks are condi-
it. In addition, we modify the bidirectional joint
tioned on the word alignment, in addition to the
model presented in (Sundermeyer et al., 2014a)
source and target words.
and compare it to the feed-forward variant. These
lexical models were applied in phrase-based sys- 4.1 Feed-forward Joint Model
tems. In this work, we apply them in a standalone
We adopt the feed-forward joint model (FFJM)
NMT framework.
proposed in (Devlin et al., 2014) as the lexical
Forced alignment was applied to train phrase ta-
model. The authors demonstrate the model has
bles in (Wuebker et al., 2010; Peitz et al., 2012).
a strong performance when applied in a phrase-
We generate forced alignments using a neural de-
based framework. In this work we explore its
coder, and use them to re-train neural models.
performance in standalone NMT. The model was
Tackling the costly normalization of the out-
introduced along with heuristics to resolve un-
put layer during decoding has been the focus of
aligned and multiply aligned words. We denote
several papers (Vaswani et al., 2013; Devlin et
the heuristic-based source alignment point corre-
al., 2014; Jean et al., 2015). We propose a sim-
sponding to the target position i by b̂i . The model
ple method to speed up decoding using a class-
is defined as
factored output layer with almost no loss in trans-
lation quality. b̂i +m
p(ei |bi1 , ei−1 J i−1
1 , f1 ) = p(ei |ei−n , f ) (1)
b̂i −m
3 Statistical Machine Translation
and it computes the probability of a target word
In statistical machine translation, the target word ei at position i given the n-gram target history
sequence eI1 = e1 , ..., eI of length I is assigned ei−1
i−n = ei−n , ..., ei−1 , and a window of 2m + 1
a probability conditioned on the source word se-
quence f1J = f1 , ..., fJ of length J. By introduc- source words f b̂i +m = fb̂i −m , ..., fb̂i+m centered
b̂i −m
ing word alignments as hidden variables, the pos- around the word fb̂i .
terior probability p(eI1 |f1J ) can be computed using As the heuristics have implications on our
a lexical and an alignment model as follows. alignment-based decoder, we explain them by the
examples shown in Figure 1. We mark the source
p(eI1 |f1J ) and target context by rectangles on the x- and y-
X
= p(eI1 , bI1 |f1J ) axis, respectively. The left figure shows a sin-
bI1 gle source word ‘Jungen’ aligned to a single tar-
I get word ‘offspring’, in which case, the original
XY
= p(ei , bi |bi−1 i−1 J
1 , e1 , f1 )
source position is used, i.e., b̂i = bi . If the tar-
bI1 i=1 get word is aligned to multiple source words, as
I
it is the case with the words ‘Mutter Tiere’ and
‘Mothers’ in the middle figure, then b̂i is set to
XY
= p(ei |bi1 , ei−1 , f1J ) · p(bi |bi−1 i−1 J
1 , e1 , f1 )
| {z1 } | {z } the middle alignment point. In this example, the
bI1 i=1
lexical model alignment model
left alignment point associated with ‘Mutter’ is se-
where bI1
= b1 , ..., bI denotes the alignment path, lected. The right figure shows the case of the un-
such that bi aligns the target word ei to the source aligned target word ‘of’. b̂i is set to the source
word fbi . In this general formulation, the lexi- position associated with the closest aligned tar-
cal model predicts the target word ei conditioned get word ‘full’, preferring right to left. Note that
on the source sentence, the target history, and the this model does not have special handling of un-
alignment history. The alignment model is lexical- aligned source words. While these words can be
ized using the source and target context as well. covered indirectly by source windows associated
The sum over alignment paths is replaced by the with aligned source words, the model does not ex-
maximum during decoding (cf. Section 5). plicitly score them.
Page 23 of 78
Figure 1: Examples on resolving word alignments to obtain word affiliations.
Computing normalized probabilities is done us- Note that we also use the same alignment heuris-
ing the softmax function, which requires comput- tics presented in Section 4.1. As this variant does
ing the full output layer first, and then comput- not require future alignment information, it can be
ing the normalization factor by summing over the applied in decoding. However, in this work we
output scores of the full vocabulary. This is very apply this model in rescoring and leave decoder
costly for large vocabularies. To overcome this, integration to future work.
we adopt the class-factored output layer consisting
of a class layer and a word layer (Goodman, 2001; 4.3 Feed-forward Alignment Model
Morin and Bengio, 2005). The model in this case We propose a neural alignment model to score
is defined as alignment paths. Instead of predicting the abso-
b̂i +m lute positions in the source sentence, we model
p(ei |ei−1
i−n , f )=
b̂i −m the jumps from one source position to the next po-
p(ei |c(ei ), ei−1 b̂i +m
) · p(c(ei )|ei−1 b̂i +m sition to be translated. The jump at target posi-
i−n , f b̂i −m i−n , f b̂i −m
)
tion i is defined as ∆i = b̂i − b̂i−1 , which cap-
where c denotes a word mapping that assigns each tures the jump from the source position b̂i−1 to b̂i .
target word to a single class, where the number We modify the FFNN lexical model to obtain a
of classes is chosen to be much smaller than the feed-forward alignment model. The feed-forward
vocabulary size |C| << |V |. Even though the alignment model (FFAM) is given by
full class layer needs to be computed, only a sub-
set of the significantly-larger word layer has to be b̂i−1 +m
p(bi |bi−1 i−1 J i−1
1 , e1 , f1 ) = p(∆i |ei−n , f ) (2)
b̂i−1 −m
considered, namely the words that share the same
class c(ei ) with the target word ei . This helps This is a lexicalized alignment model condi-
speeding up training on large-vocabulary tasks. tioned on the n-gram target history and the
4.2 Bidirectional Joint Model (2m + 1)-gram source window. Note that, dif-
ferent from the FFJM, the source window of this
The bidirectional RNN joint model (BJM) pre-
model is centered around the source position b̂i−1 .
sented in (Sundermeyer et al., 2014a) is another
This is because the model needs to predict the
lexical model. The BJM uses the full source sen-
jump to the next source position b̂i to be translated.
tence and the full target history for prediction, and
The alignment model architecture is shown in Fig-
it is computed by reordering the source sentence
ure 2.
following the target order. This requires the com-
In contrast to the lexical model, the output vo-
plete alignment information to compute the model
cabulary of the alignment model is much smaller,
scores. Here, we introduce a variant of the model
and therefore we use a regular softmax output
that is conditioned on the alignment history in-
layer for this model without class-factorization.
stead of the full alignment path. This is achieved
by computing forward and backward representa- 4.4 Feed-forward vs. Recurrent Models
tions of the source sentence in its original order,
RNNs have been shown to outperform feed-
as done in (Bahdanau et al., 2015). The model is
forward variants in language and translation mod-
given by
eling. Nevertheless, feed-forward networks have
p(ei |bi1 , ei−1 J i i−1 J
1 , f1 ) = p(ei |b̂1 , e1 , f1 ) their own advantages: First, they are typically
Page 24 of 78
b̂i−1 +2 Algorithm 1 Alignment-based Decoder

p(∆i |ei−1
i−3 , f )
b̂i−1 −2 1: procedure T RANSLATE(f1J , beamSize)
2: hyps← initHyp .previous set of partial hypotheses
3: newHyps← ∅ .current set of partial hypotheses
4: while G ET B EST(hyps) not terminated do
5: .compute alignment distribution in batch mode
6: alignDists ←A LIGNMENT D ISTRIBUTION(hyps)
7: .hypothesize source alignment points
8: for pos From 1 to J do
9: .compute lexical distributions of all
10: .hypotheses in hyps in batch mode
11: dists ← L EXICAL D ISTRIBUTION(hyps, pos)
f
b̂i−1 +2 ei−1
i−3 12: .expand each of the previous hypotheses
b̂i−1 −2 13: for hyp in hyps do
14: jmpCost ← S CORE(alignDists, hyp, pos)
15: dist ← G ET D ISTRIBUTION(dists, hyp)
Figure 2: A feed-forward alignment NN, with 3
16: dist ← PARTIAL S ORT(dist,beamSize)
target history words, 5-gram source window, a 17: cnt← 0
projection layer, 2 hidden layers, and a small out- 18: .hypothesize new target word
put layer to predict jumps. 19: for word in dist do
20: if cnt > beamSize then
21: break
faster to train due to their simple architecture, 22: newHyp ←E XTEND(hyp,word,pos,jmpCost)
and second, they are more flexible to integrate 23: newHyps.I NSERT(newHyp)
24: cnt ← cnt + 1
into beam search decoders. This is because feed-
25: P RUNE(newHyps, beamSize)
forward networks only depend on a limited con- 26: hyps ← newHyps
text. RNNs, on the other hand, are conditioned on 27:
an unbounded context. This means that the com- 28: .return the best scoring hypothesis
plete hypotheses during decoding have to be main- 29: return G ET B EST(hyps)
tained without any state recombination. Since
feed-forward networks allow the use of state re-
duced to enumerating all J source positions a tar-
combination, they are potentially capable of ex-
get word can be aligned to. The following is a list
ploring more candidates during beam search.
of the possible alignment scenarios and how the
decoder covers them.
5 Alignment-based Decoder
• Multiply-aligned target words: the heuris-
In this section we present the alignment-based de-
tic chooses the middle link as an alignment
coder. This is a beam-search word-based decoder
point. Therefore, the decoder is able to cover
that predicts one target word at a time. As the
these cases by hypothesizing J many source
models we use are alignment-based, the decoder
positions for each target word hypothesis.
hypothesizes the alignment path. This is different
from the NMT approaches present in the literature, • Unaligned target words: the heuristic aligns
which are based on models that either ignore word these words using the nearest aligned target
alignments or compute alignments as part of the word in training (cf. Figure 1, right). In de-
attention-based model. coding, these words are handled as aligned
In the general case, a word can be aligned to words.
a single word, multiple words, or it can be un-
aligned. However, we do not use the general word • Multiply-aligned source words: covered by
alignment notion, rather, the models are based on revisiting a source position that has already
alignments derived using the heuristics discussed been translated.
in Section 4. These heuristics simplify the task • Unaligned source words: result if no target
of the decoder, as they induce equivalence classes word is generated using a source window
over the alignment paths, reducing the number of centered around the source word in question.
possible alignments the decoder has to hypothe-
size significantly. As a result of using these heuris- The decoder is shown in Algorithm 1. It in-
tics, the task of hypothesizing alignments is re- volves hypothesizing alignments and translation
Page 25 of 78
words. Alignments are hypothesized in the loop Model Combination

starting at line 8. Once an alignment point is set to We embed the models in a log-linear framework,
position pos, the lexical distribution over the full which is commonly used in phrase-based systems.
target vocabulary is computed using this position The goal of the decoder is to find the best scoring
in line 11. The distribution is sorted and the best hypothesis as follows.
candidate translations lying within the beam are
M
( )
used to expand the partial hypotheses. Iˆ
X
J I I
We batch the NN computations, calling the ê1 = arg max max λm hm (f1 , e1 , b̂1 )
I,eI1 b̂I1 m=1
alignment and lexical networks for all partial hy-
potheses in a single call to speed up computations where λm is the model weight associated with the
as shown in lines 6 and 11. We also exploit the model hm , and M is the total number of models.
beam and apply partial sorting in line 16, instead The model weights are automatically tuned using
of completely sorting the list. Partial sorting has a minimum error rate training (MERT) (Och, 2003).
linear complexity on average, and it returns a list Our main system includes a lexical neural model,
whose first beamSize words have better scores an alignment neural model, and a word penalty,
compared to the rest of the list. which is the count of target words. The word
We terminate translation if the best scoring par- penalty becomes important at the end of transla-
tial hypothesis ends with the sentence end symbol. tion, where hypotheses in the beam might have
If a hypothesis terminates but it scores worse than different final lengths.
other hypotheses, it is removed from the beam, but
6 Forced-Alignment Training
it still competes with non-terminated hypotheses.
Note that we do not have any explicit coverage Since the models we use require alignments for
constraints. This means that a source position can training, we initially use word alignments pro-
be revisited many times, hence generating one-to- duced using HMM/IBM models using GIZA++ as
many alignment cases. This also allows having un- initial alignments. At first, the FFJM and the
aligned source words. FFAM are trained separately until convergence,
In the alignment-based decoder, an alignment then the models are used to generate new word
distribution is computed, and word alignments are alignments by force-decoding the training data as
hypothesized and scored using this distribution, follows.
leading alignment decisions to become part of I
b +m
Y
i−1
beam search. The search space is composed of b̃I1 (f1J , eI1 ) = arg max pλ1 (∆i |ei−n i−1
, fbi−1 −m )
both alignment and translation decisions. In con- bI1
i=1
trast, the search space in attention-based decoding · pλ2 (ei |ei−1 bi +m
i−n , fbi −m )
is composed of translation decisions only.
where λ1 and λ2 are the model weights. We mod-
Class-Factored Output Layer in Decoding ify the decoder to only compute the probabilities
The large output layer used in language and trans- of the target words in the reference sentence. The
lation modeling is a major bottleneck in evaluating for loop in line 19 of Algorithm 1 collapses to a
the network. Several papers discuss how to evalu- single iteration. We use both the the feed-forward
ate it efficiently during decoding using approxima- joint model (FFJM) and the feed-forward align-
tions. In this work, we exploit the class-factored ment model (FFAM) to perform force-decoding,
output layer to speed up training. At decoding and the new alignments are used to retrain the
time, the network needs to hypothesize all target models, replacing the initial GIZA++ alignments.
words, which means the full output layer should Retraining the neural models using the forced-
be evaluated. In the case of using a class-factored alignments has two benefits. First, since the align-
output layer, this results in an additional compu- ments are produced using both of the lexical and
tational overhead from computing the class layer. alignment models, this can be viewed as joint
In order to speed up decoding, we propose to use training of the two models. Second, since the neu-
the class layer to choose the top scoring k classes, ral decoder generates these alignments, training
then we evaluate the word layer for each of these neural models based on them yields models that
classes only. We show this leads to a significant are more consistent with the neural decoder. We
speed up with minimal loss in translation quality. verify this claim in the experiments section.
Page 26 of 78
IWSLT BOLT training criterion with the bigram assumption. The

De En Zh En alignment model uses a small output layer of 201
Sentences 4.32M 4.08M nodes, determined by a maximum jump length of
Run. Words 108M 109M 78M 86M 100 (forward and backward). 300 nodes are used
Vocab. 836K 792K 384K 817K
for word embeddings. Each of the FFNN models
FFNN/BJM Vocab. 173K 149K 169K 128K is trained on CPUs using 12 threads, which takes
Attention Vocab. 30K 30K 30K 30K
up to 3 days until convergence. We train with
FFJM params 177M 159M
stochastic gradient descent using a batch size of
BJM params 170M 153M
FFAM params 101M 94M 128. The learning rate is halved when the devel-
Attention params 84M 84M opment perplexity increases.
Each BJM has 4 LSTM layers: two for the for-
Table 1: Corpora and NN statistics.
ward and backward states, one for the target state,
7 Experiments and one after merging the source and target states.
The size of the word embeddings and hidden lay-
We carry out experiments on two tasks: the ers is 350 nodes. The output layers are identical to
IWSLT 2013 German→English shared transla- those of the FFJM models.
tion task,1 and the BOLT Chinese→English task. We compare our system to an attention-based
The corpora statistics are shown in Table 1. The baseline similar to the networks described in (Bah-
IWSLT phrase-based baseline system is trained on danau et al., 2015). All such systems use single
all available bilingual data, and uses a 4-gram LM models, rather than ensembles. The word embed-
with modified Kneser-Ney smoothing (Kneser and ding dimension is 620, each direction of the en-
Ney, 1995; Chen and Goodman, 1998), trained coder and the decoder has a layer of 1000 gated
with the SRILM toolkit (Stolcke, 2002). As ad- recurrent units (Cho et al., 2014). Unknowns and
ditional data sources for the LM, we selected parts numbers are carried out from the source side to the
of the Shuffled News and LDC English Giga- target side based on the largest attention weight.
word corpora based on the cross-entropy differ-
To speed up decoding of long sentences, the
ence (Moore and Lewis, 2010), resulting in a to-
decoder hypothesizes 21 and 41 source positions
tal of 1.7 billion running words for LM training.
around the diagonal, for the IWSLT and the BOLT
The phrase-based baseline is a standard phrase-
tasks, respectively. We choose these numbers
based SMT system (Koehn et al., 2003) tuned with
such that the translation quality does not degrade.
MERT (Och, 2003) and contains a hierarchical re-
The beam size is set to 16 in all experiments.
ordering model (Galley and Manning, 2008). The
Larger beam sizes did not lead to improvements.
in-domain data consists of 137K sentences.
We apply part-of-speech-based long-range verb
The BOLT Chinese→English task is evaluated reordering rules to the German side in a pre-
on the “discussion forum” domain. We use a 5- processing step for all German→English systems
gram LM trained on 2.9 billion running words in (Popović and Ney, 2006), including the baselines.
total. The in-domain data consists of a subset of The Chinese→English systems use no such pre-
67.8K sentences. We used a set of 1845 sentences ordering. We use the GIZA++ word alignments
as a tune set. The evaluation set test1 contains to train the models. The networks are fine-tuned
1844 and test2 contains 1124 sentences. by training additional epochs on the in-domain
We use the FFNN architecture for the lexical data only (Luong and Manning, 2015). The LMs
and alignment models. Both models use a win- are only used in the phrase-based systems in both
dow of 9 source words, and 5 target history words. tasks, but not in the NMT systems.
Both models use two hidden layers, the first has
All translation experiments are performed with
1000 units and the second has 500 units. The lex-
the Jane toolkit (Vilar et al., 2010; Wuebker et al.,
ical model uses a class-factored output layer, with
2012). The alignment-based NNs are trained using
1000 singleton classes dedicated to the most fre-
an extension of the rwthlm toolkit (Sundermeyer et
quent words, and 1000 classes shared among the
al., 2014b). We use an implementation based on
rest of the words. The classes are trained using a
Blocks (van Merriënboer et al., 2015) and Theano
separate tool to optimize the maximum likelihood
(Bergstra et al., 2010; Bastien et al., 2012) for the
1
http://www.iwslt2013.org attention-based experiments. All results are mea-
Page 27 of 78
test 2010 eval 2011 rable performance is achieved on test 2010.

# system B LEU T ER B LEU T ER
In order to highlight the difference between us-
1 phrase-based system 28.9 51.0 32.9 46.3
2 + monolingual data 30.4 49.5 35.4 44.2 ing the FFJM and the BJM, we replace the FFJM
3 attention-based RNN 27.9 51.4 31.8 46.5 scores after obtaining the N -best lists with the
4 +fine-tuning 29.8 48.9 32.9 45.1
BJM scores and apply rescoring (row #9). In com-
5 FFJM+dp+wp 21.6 56.9 24.7 53.8
6 FFJM+FFAM+wp 26.1 53..1 29.9 49.4 parison to row #7, we observe up to 0.5% B LEU
7 +fine-tuning 29.3 50.5 33.2 46.5 and 1.0% T ER improvement. This is expected
8 +BJM Rescoring 30.0 48.7 33.8 44.8
9 BJM+FFAM+wp+fine-tuning 29.8 49.5 33.7 45.8 as the BJM captures unbounded source and tar-
get context in comparison to the limited context of
Table 2: IWSLT 2013 German→English results in the FFJM. This calls for a direct integration of the
B LEU [%] and T ER [%]. BJM into decoding, which we intend to do in fu-
ture work. Our best system (row #8) outperforms
the phrase-based system (row #1) by up to 1.1%
sured in case-insensitive B LEU [%] (Papineni et B LEU and 2.3% T ER. While the phrase-based
al., 2002) and T ER [%] (Snover et al., 2006) on system can benefit from training the LM on addi-
a single reference. We used the multeval toolkit tional monolingual data (row #1 vs. #2), exploit-
(Clark et al., 2011) for evaluation. ing monolingual data in NMT systems is still an
7.1 IWSLT 2013 German→English open research question.
Table 2 shows the IWSLT German→English re- 7.2 BOLT Chinese→English

sults. FFJM refers to feed-forward lexical model. The BOLT Chinese→English experiments are
We compare against the phrase-based system with shown in Table 3. Again, we observe large im-
an LM trained on the target side of the bilin- provements when including the FFAM in compar-
gual data (row #1), the phrase-based system with ison to the distortion penalty (row #5 vs #6), and
an LM trained on additional monolingual data fine-tuning improves the results considerably. In-
(row #2), the attention-based system (row #3), cluding the BJM in rescoring improves the system
and the attention-based system after fine-tuning by up to 0.4% BLEU. Our best system (row #8)
towards the in-domain data (row #4). First, we ex- outperforms the attention-based model by up to
periment with a system using the FFJM as a lex- 0.4% B LEU and 2.8% T ER. We observe that the
ical model and a linear distortion penalty (dp) to length ratio of our system’s output to the reference
encourage monotone translation as the alignment is 93.3-94.9%, while it is 99.1-102.6% for the
model. We also include a word penalty (wp). This attention-based system. In light of the B LEU and
system is shown in row #5. In comparison, if the T ER scores, the attention-based model does not
distortion penalty is replaced by the feed-forward benefit from matching the reference length. Our
alignment model (FFAM), we observe large im- system (row #8) still lags behind the phrase-based
provements of 4.5% to 5.2% B LEU (row #5 vs. system (row #1). Note, however, that in the WMT
#6). This highlights the significant role of the 2016 evaluation campaign,2 it was demonstrated
alignment model in our system. Moreover, it in- that NMT can outperform phrase-based systems
dicates that the FFAM is able to model alignments on several tasks including German→English and
beyond the simple monotone alignments preferred English→German. Including monolingual data
by the distortion penalty. (Sennrich et al., 2016) in training neural transla-
Fine-tuning the neural networks towards in- tion models can boost performance, and this can
domain data improves the system by up to 3.3% be applied to our system.
B LEU and 2.9% T ER (row #6 vs #7). The gain
from fine-tuning is larger than the one observed 7.3 Neural Alignments
for the attention-based system. This is likely due
Next, we experiment with re-aligning the train-
to the fact that our system has two neural models,
ing data using neural networks as described in
and each of them is fine-tuned.
Section 6. We use the fine-tuned FFJM and
We apply the BJM in 1000-best list rescoring FFAM to realign the in-domain data of the IWSLT
(row #8). Which gives another boost, leading our German→English task. These models are initially
system to outperform the attention-based system
2
by 0.9% B LEU on eval 2011, while a compa- http://matrix.statmt.org/
Page 28 of 78
test1 test2 29.5

# system B LEU T ER B LEU T ER BLEU[%] 35
1 phrase-based system 17.6 68.3 16.9 67.4 speed-up factor 30
29
2 + monolingual data 17.9 67.9 17.0 67.1
speed-up factor
25
3 attention-based RNN 14.8 76.1 13.6 76.9
BLEU[%]
28.5
4 +fine-tuning 16.1 73.1 15.4 72.3 20
5 FFJM+dp+wp 10.1 77.2 9.8 75.8 15
28
6 FFJM+FFAM+wp 14.4 71.9 13.7 71.3
7 +fine-tuning 15.8 70.3 15.4 69.4 10
8 +BJM Rescoring 16.0 70.3 15.8 69.5 27.5
9 BJM+FFAM+wp+fine-tuning 16.0 70.4 15.7 69.7 5
27 0
1 10 100 1000
Table 3: BOLT Chinese→English results in B LEU
classes
[%] and T ER [%].
test 2010 eval 2011

Figure 3: Decoding speed-up and translation qual-
Alignment Source B LEU T ER B LEU T ER ity using top scoring classes in a class-factored
GIZA++ 25.6 53.6 29.3 49.7
output layer. The results are computed for the
Neural Forced decoding 25.9 52.4 29.5 49.4 IWSLT German→English dev dataset.
.
Table 4: Re-alignment results in B LEU [%] and factories
T ER [%] on the IWSLT 2013 German→English coal
in-domain data. Each system includes FFJM, other
of
FFAM and word penalty. lot
a
build
trained using GIZA++ alignments. We train new to
was
models using the re-aligned data and compare the proposal
translation quality before and after re-alignment. the
and
We use 0.7 and 0.3 as model weights for the FFJM
, chla
wa
zu
Vo
un
de
ba
vi en
we e
Ko tere
Fa le
and FFAM, respectively. These values are based . ken

el
r
br
d
rs
r
i
h
i
on the model weights obtained using MERT. The

g
results are shown in Table 4. Note that the base-

Figure 4: A translation example produced by
line is worse than the one in Table 2 as the models
our system. The shown German sentence is pre-
are only trained on the in-domain data. We ob- ordered.
serve that re-aligning the data improves translation
quality by up to 0.3% B LEU and 1.2% T ER. The
new alignments are generated using the neural de- 8 Analysis
coder, and using them to train the neural networks
results in training that is more consistent with de- We show an example from the German→English
coding. As future work, we intend to re-align the task in Figure 4, along with the alignment path.
full bilingual data and use it for neural training. The reference translation is ‘and the proposal has
been to build a lot more coal plants .’. Our sys-
7.4 Class-Factored Output Layer tem handles the local reordering of the word ‘was’,
Figure 3 shows the trade-off between speed and which is produced in the correct target order. An
performance when evaluating words belonging to example on the one-to-many alignments is given
the top classes only. Limiting the evaluation to by the correct translation of ‘viele’ to ‘a lot of’.
words belonging to the top class incurs a perfor- As an example on handling multiply-aligned
mance loss of 0.4% B LEU only when compared to target words, we observe the translation of ‘Nord
the full evaluation of the output layer. However, Westen’ to ‘northwest’ in our output. This is pos-
this corresponds to a large speed-up. The system sible because the source window allows the FFNN
is about 30 times faster, with a translation speed to translate the word ‘Westen’ in context of the
of 0.4 words/sec. In conclusion, not only does word ‘Nord’.
the class layer speed up training, but it can also be Table 5 lists some translation examples pro-
used to speed up decoding considerably. We use duced by our system and the attention-based sys-
the top 3 classes throughout our experiments. tem, where maximum attention weights are used
Page 29 of 78
source sie würden verhungern nicht , und wissen Sie was ?

1 reference they wouldn ’t starve , and you know what ?
attention NMT you wouldn ’t interview , and guess what ?
our system they wouldn ’t starve , and you know what ?
source denn sie sind diejenigen , die sind auch Experten für Geschmack .
2 reference because they ’re the ones that are experts in flavor , too .
attention NMT because they ’re the ones who are also experts .
our system because they ’re the ones who are also experts in flavor .
source es ist ein Online Spiel , in dem Sie müssen überwinden eine Ölknappheit .
3 reference this is an online game in which you try to survive an oil shortage .
attention NMT it ’s an online game where you need to get through a UNKOWN .
our system it ’s an online game in which you have to overcome an astrolabe .
source es liegt daran , dass gehen nicht Möglichkeiten auf diesem Planeten zurück, sie gehen vorwärts .
4 reference it ’s because possibilities on this planet , they don ’t go back , they go forward .
attention NMT it ’s because there ’s no way back on this planet , they ’re going to move forward .
our system it ’s because opportunities don ’t go on this planet , they go forward .
Table 5: Sample translations from the IWSLT German→English test set using the attention-based
system (Table 2, row #4) and our system (Table 2, row #7). We highlight the (pre-ordered) source words
and their aligned target words. We underline the source words of interest, italicize correct translations,
and use bold-face for incorrect translations.
as alignment. While we use larger vocabularies T ER. We also demonstrate that re-aligning the
compared to the attention-based system, we ob- training data using the neural decoder yields better
serve incorrect translations of rare words. E.g., translation quality.
the German word Ölknappheit in sentence 3 oc- As future work, we aim to integrate alignment-
curs only 7 times in the training data among 108M based RNNs such as the BJM into the alignment-
words, and therefore it is an unknown word for based decoder. We also plan to develop a bidirec-
the attention system. Our system has the word in tional RNN alignment model to make jump deci-
the source vocabulary but fails to predict the right sions based on unbounded context. In addition, we
translation. Another problem occurs in sentence want to investigate the use of coverage constraints
4, where the German verb “zurückgehen” is split in alignment-based NMT. Furthermore, we con-
into “gehen ... zurück”. Since the feed-forward sider the re-alignment experiment promising and
model uses a source window of size 9, it cannot plan to apply re-alignment on the full bilingual
include both words when it is centered at any of data of each task.
them. Such insufficient context might be resolved
when integrating the bidirectional RNN in decod- Acknowledgments
ing. Note that the attention-based model also fails This paper has received funding from the Euro-
to produce the correct translation here. pean Union’s Horizon 2020 research and innova-
tion programme under grant agreement no 645452
9 Conclusion (QT21). Tamer Alkhouli was partly funded by the
2016 Google PhD Fellowship for North America,
This work takes a step towards bridging the gap
Europe and the Middle East.
between conventional word alignment concepts
and NMT. We use an HMM-inspired factoriza-
tion of the lexical and alignment models, and em- References
ploy the Viterbi alignments obtained using con-
Tamer Alkhouli, Andreas Guta, and Hermann Ney.
ventional HMM/IBM models to train neural mod- 2014. Vector space models for phrase-based ma-
els. An alignment-based decoder is introduced chine translation. In EMNLP 2014 Eighth Work-
and a log-linear framework is used to combine the shop on Syntax, Semantics and Structure in Statis-
models. We use MERT to tune the model weights. tical Translation, pages 1–10, Doha, Qatar, October.
Our system outperforms the attention-based sys- Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015.
tem on the German→English task by up to 0.9% Investigations on phrase-based decoding with recur-
B LEU, and on Chinese→English by up to 2.8% rent neural network language and translation mod-
Page 30 of 78
els. In Proceedings of the EMNLP 2015 Tenth Work- model. In Proceedings of the Conference on Empiri-
shop on Statistical Machine Translation, pages 294– cal Methods in Natural Language Processing, pages
303, Lisbon, Portugal, September. 848–856, Honolulu, Hawaii, USA, October.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Joshua Goodman. 2001. Classes for fast maximum en-
gio. 2015. Neural machine translation by jointly tropy training. In Proceedings of the IEEE Interna-
learning to align and translate. In International Con- tional Conference on Acoustics, Speech, and Signal
ference on Learning Representations, San Diego, Processing, volume 1, pages 561–564, Utah, USA,
Calefornia, USA, May. May.
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Jo-
James Bergstra, Ian J. Goodfellow, Arnaud Berg- ern Wuebker, and Hermann Ney. 2015. A Com-
eron, Nicolas Bouchard, and Yoshua Bengio. 2012. parison between Count and Neural Network Mod-
Theano: new features and speed improvements. De- els Based on Joint Translation and Reordering Se-
cember. quences. In Conference on Empirical Methods on
James Bergstra, Olivier Breuleux, Frédéric Bastien, Natural Language Processing, pages 1401–1411,
Pascal Lamblin, Razvan Pascanu, Guillaume Des- Lisbon, Portugal, September.
jardins, Joseph Turian, David Warde-Farley, and
Yoshua Bengio. 2010. Theano: a CPU and GPU Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao.
math expression compiler. In Proceedings of the 2014. Minimum translation modeling with recur-
Python for Scientific Computing Conference, Austin, rent neural networks. In Proceedings of the 14th
TX, USA, June. Conference of the European Chapter of the Associ-
ation for Computational Linguistics, pages 20–29,
Stanley F. Chen and Joshua Goodman. 1998. An Gothenburg, Sweden, April.
Empirical Study of Smoothing Techniques for Lan-
guage Modeling. Technical Report TR-10-98, Com- Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
puter Science Group, Harvard University, Cam- and Yoshua Bengio. 2015. On using very large tar-
bridge, MA, August. get vocabulary for neural machine translation. In
Proceedings of the Annual Meeting of the Associa-
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- tion for Computational Linguistics, pages 1–10, Bei-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger jing, China, July.
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using RNN encoder–decoder Reinhard Kneser and Hermann Ney. 1995. Improved
for statistical machine translation. In Proceedings of backing-off for M-gram language modeling. In Pro-
the 2014 Conference on Empirical Methods in Nat- ceedings of the International Conference on Acous-
ural Language Processing, pages 1724–1734, Doha, tics, Speech, and Signal Processing, volume 1, pages
Qatar, October. 181–184, May.
Jonathan H. Clark, Chris Dyer, Alon Lavie, and Philipp Koehn, Franz J. Och, and Daniel Marcu.
Noah A. Smith. 2011. Better hypothesis testing for 2003. Statistical Phrase-Based Translation. In Pro-
statistical machine translation: Controlling for opti- ceedings of the 2003 Meeting of the North Ameri-
mizer instability. In 49th Annual Meeting of the As- can chapter of the Association for Computational
sociation for Computational Linguistics, pages 176– Linguistics, pages 127–133, Edmonton, Canada,
181, Portland, Oregon, June. May/June.
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy-
molova, Kaisheng Yao, Chris Dyer, and Gholamreza Hai Son Le, Alexandre Allauzen, and François Yvon.
Haffari. 2016. Incorporating structural alignment 2012. Continuous Space Translation Models with
biases into an attentional neural translation model. Neural Networks. In Conference of the North Amer-
In Proceedings of the 2016 Conference of the North ican Chapter of the Association for Computational
American Chapter of the Association for Computa- Linguistics: Human Language Technologies, pages
tional Linguistics: Human Language Technologies, 39–48, Montreal, Canada, June.
pages 876–885, San Diego, California, June.
Minh-Thang Luong and Christopher D. Manning.
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas 2015. Stanford neural machine translation systems
Lamar, Richard Schwartz, and John Makhoul. 2014. for spoken language domains. In Proceedings of the
Fast and Robust Neural Network Joint Models for International Workshop on Spoken Language Trans-
Statistical Machine Translation. In 52nd Annual lation, pages 76–79, Da Nag, Vietnam, December.
Meeting of the Association for Computational Lin-
guistics, pages 1370–1380, Baltimore, MD, USA, Minh-Thang Luong, Hieu Pham, and Christopher D.
June. Manning. 2015. Effective approaches to attention-
based neural machine translation. In Conference on
Michel Galley and Christopher D. Manning. 2008. A Empirical Methods in Natural Language Process-
simple and effective hierarchical phrase reordering ing, pages 1412–1421, Lisbon, Portugal, September.
Page 31 of 78
R.C. Moore and W. Lewis. 2010. Intelligent Selec- Conference on Empirical Methods on Natural Lan-
tion of Language Model Training Data. In Proceed- guage Processing, pages 14–25, Doha, Qatar, Octo-
ings of the 48th Annual Meeting of the Association ber.
for Computational Linguistics, pages 220–224, Up-
psala, Sweden, July. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
2014b. rwthlm - the RWTH Aachen university neu-
Frederic Morin and Yoshua Bengio. 2005. Hierarchi- ral network language modeling toolkit. In Inter-
cal probabilistic neural network language model. In speech, pages 2093–2097, Singapore, September.
Proceedings of the international workshop on artifi-
cial intelligence and statistics, pages 246–252, Bar- Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le.
bados, January. 2014. Sequence to sequence learning with neu-
ral networks. In Advances in Neural Information
Franz Josef Och. 2003. Minimum Error Rate Train- Processing Systems 27, pages 3104–3112, Montréal,
ing in Statistical Machine Translation. In Proceed- Canada, December.
ings of the 41th Annual Meeting of the Association
for Computational Linguistics, pages 160–167, Sap- Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita.
poro, Japan, July. 2014. Recurrent neural networks for word align-
ment model. In 52nd Annual Meeting of the Asso-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ciation for Computational Linguistics, pages 1470–
Jing Zhu. 2002. Bleu: a Method for Automatic 1480, Baltimore, MD, USA.
Evaluation of Machine Translation. In Proceed-
ings of the 41st Annual Meeting of the Associa- Bart van Merriënboer, Dzmitry Bahdanau, Vincent Du-
tion for Computational Linguistics, pages 311–318, moulin, Dmitriy Serdyuk, David Warde-Farley, Jan
Philadelphia, Pennsylvania, USA, July. Chorowski, and Yoshua Bengio. 2015. Blocks
and fuel: Frameworks for deep learning. CoRR,
Stephan Peitz, Arne Mauser, Joern Wuebker, and Her- abs/1506.00619.
mann Ney. 2012. Forced derivations for hierarchi-
Ashish Vaswani, Yinggong Zhao, Victoria Fossum,
cal machine translation. In International Confer-
and David Chiang. 2013. Decoding with large-
ence on Computational Linguistics, pages 933–942,
scale neural language models improves translation.
Mumbai, India, December.
In Proceedings of the 2013 Conference on Empiri-
Maja Popović and Hermann Ney. 2006. POS-based cal Methods in Natural Language Processing, pages
word reorderings for statistical machine transla- 1387–1392, Seattle, Washington, USA, October.
tion. In Language Resources and Evaluation, pages
David Vilar, Daniel Stein, Matthias Huck, and Her-
1278–1283, Genoa, Italy, May.
mann Ney. 2010. Jane: Open source hierarchi-
Holger Schwenk. 2012. Continuous Space Translation cal translation, extended with reordering and lexi-
Models for Phrase-Based Statistical Machine Trans- con models. In ACL 2010 Joint Fifth Workshop on
lation. In 25th International Conference on Com- Statistical Machine Translation and Metrics MATR,
putational Linguistics, pages 1071–1080, Mumbai, pages 262–270, Uppsala, Sweden, July.
India, December. Joern Wuebker, Arne Mauser, and Hermann Ney.
2010. Training phrase translation models with
Rico Sennrich, Barry Haddow, and Alexandra Birch.
leaving-one-out. In Proceedings of the Annual
2016. Improving neural machine translation models
Meeting of the Association for Computational Lin-
with monolingual data. In Proceedings of 54th An-
guistics, pages 475–484, Uppsala, Sweden, July.
nual Meeting of the Association for Computational
Linguistics, August. Joern Wuebker, Matthias Huck, Stephan Peitz, Malte
Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Mansour, and Hermann Ney. 2012. Jane 2: Open
nea Micciulla, and John Makhoul. 2006. A Study of source phrase-based and hierarchical statistical ma-
Translation Edit Rate with Targeted Human Annota- chine translation. In International Conference on
tion. In Proceedings of the 7th Conference of the As- Computational Linguistics, pages 483–491, Mum-
sociation for Machine Translation in the Americas, bai, India, December.
pages 223–231, Cambridge, Massachusetts, USA,
August. Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Neng-
hai Yu. 2013. Word alignment modeling with con-
Andreas Stolcke. 2002. SRILM – An Extensible text dependent deep neural network. In 51st Annual
Language Modeling Toolkit. In Proceedings of the Meeting of the Association for Computational Lin-
International Conference on Speech and Language guistics, pages 166–175, Sofia, Bulgaria, August.
Processing, volume 2, pages 901–904, Denver, CO,
September.
Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,

and Hermann Ney. 2014a. Translation Modeling
with Bidirectional Recurrent Neural Networks. In
Page 32 of 78
C The QT21/HimL Combined Machine Translation System
The QT21/HimL Combined Machine Translation System

Jan-Thorsten Peter1 , Tamer Alkhouli1 , Hermann Ney1 , Matthias Huck2 ,
Fabienne Braune2 , Alexander Fraser2 , Aleš Tamchyna2,3 , Ondřej Bojar3 ,
Barry Haddow4 , Rico Sennrich4 , Frédéric Blain5 , Lucia Specia5 ,
Jan Niehues6 , Alex Waibel6 , Alexandre Allauzen7 , Lauriane Aufrant7,8 ,
Franck Burlot7 , Elena Knyazeva7 , Thomas Lavergne7 , François Yvon7 ,
Stella Frank9 , Mārcis Pinnis10
1
2
LMU Munich, Munich, Germany
3
Charles University in Prague, Prague, Czech Republic
4
University of Edinburgh, Edinburgh, UK
5
University of Sheffield, Sheffield, UK
6
Karlsruhe Institute of Technology, Karlsruhe, Germany
7
LIMSI, CNRS, Université Paris Saclay, Orsay, France
8
DGA, Paris, France
9
ILLC, University of Amsterdam, Amsterdam, The Netherlands
10
Tilde, Riga, Latvia
1
{peter,alkhouli,ney}@cs.rwth-aachen.de
2
{mhuck,braune,fraser}@cis.lmu.de
3
{tamchyna,bojar}@ufal.mff.cuni.cz
4
bhaddow@inf.ed.ac.uk rico.sennrich@ed.ac.uk
5
{f.blain,l.specia}@sheffield.ac.uk
6
{jan.niehues,alex.waibel}@kit.edu
7
{allauzen,aufrant,burlot,knyazeva,lavergne,yvon}@limsi.fr
9
s.c.frank@uva.nl
10
marcis.pinnis@tilde.lv
Abstract of substantially improving statistical and machine

learning based translation models for challenging
This paper describes the joint submis-
languages and low-resource scenarios.
sion of the QT21 and HimL projects for
Health in my Language (HimL) aims to make
the English→Romanian translation task of
public health information available in a wider va-
the ACL 2016 First Conference on Ma-
riety of languages, using fully automatic machine
chine Translation (WMT 2016). The sub-
translation that combines the statistical paradigm
mission is a system combination which
with deep linguistic techniques.
combines twelve different statistical ma-
chine translation systems provided by the In order to achieve high-quality machine trans-
different groups (RWTH Aachen Univer- lation from English into Romanian, members of
sity, LMU Munich, Charles University in the QT21 and HimL projects have jointly built a
Prague, University of Edinburgh, Univer- combined statistical machine translation system.
sity of Sheffield, Karlsruhe Institute of We participated with the QT21/HimL combined
Technology, LIMSI, University of Ams- machine translation system in the WMT 2016
terdam, Tilde). The systems are com- shared task for machine translation of news.1 Core
bined using RWTH’s system combination components of the QT21/HimL combined system
approach. The final submission shows an are twelve individual English→Romanian trans-
improvement of 1.0 B LEU compared to the lation engines which have been set up by differ-
best single system on newstest2016. ent QT21 or HimL project partners. The outputs
of all these individual engines are combined us-
1 Introduction ing the system combination approach as imple-
Quality Translation 21 (QT21) is a European ma- 1
http://www.statmt.org/wmt16/
chine translation research project with the aim translation-task.html
Page 33 of 78
mented in Jane, RWTH’s open source statistical table is adapted to the SETimes2 corpus (Niehues
machine translation toolkit (Freitag et al., 2014a). and Waibel, 2012). The system uses a pre-
The Jane system combination is a mature imple- reordering technique (Rottmann and Vogel, 2007)
mentation which previously has been successfully in combination with lexical reordering. It uses two
employed in other collaborative projects and for word-based n-gram language models and three ad-
different language pairs (Freitag et al., 2013; Fre- ditional non-word language models. Two of them
itag et al., 2014b; Freitag et al., 2014c). are automatic word class-based (Och, 1999) lan-
In the remainder of the paper, we present the guage models, using 100 and 1,000 word classes.
technical details of the QT21/HimL combined ma- In addition, we use a POS-based language model.
chine translation system and the experimental re- During decoding, we use a discriminative word
sults obtained with it. The paper is structured lexicon (Niehues and Waibel, 2013) as well.
as follows: We describe the common preprocess- We rescore the system output using a 300-best
ing used for most of the individual engines in list. The weights are optimized on the concate-
Section 2. Section 3 covers the characteristics nation of the development data and the SETimes2
of the different individual engines, followed by dev set using the ListNet algorithm (Niehues et al.,
a brief overview of our system combination ap- 2015). In rescoring, we add the source discrimina-
proach (Section 4). We then summarize our empir- tive word lexica (Herrmann et al., 2015) as well as
ical results in Section 5, showing that we achieve neural network language and translation models.
better translation quality than with any individual These models use a factored word representation
engine. Finally, in Section 6, we provide a sta- of the source and the target. On the source side
tistical analysis of certain linguistic phenomena, we use the word surface form and two automatic
specifically the prediction precision on morpho- word classes using 100 and 1,000 classes. On the
logical attributes. We conclude the paper with Romanian side, we add the POS information as an
Section 7. additional word factor.
2 Preprocessing 3.2 LIMSI

The data provided for the task was preprocessed The LIMSI system uses NCODE (Crego et al.,
once, by LIMSI, and shared with all the partici- 2011), which implements the bilingual n-gram ap-
pants, in order to ensure consistency between sys- proach to SMT (Casacuberta and Vidal, 2004;
tems. On the English side, preprocessing con- Crego and Mariño, 2006; Mariño et al., 2006) that
sists of tokenizing and truecasing using the Moses is closely related to the standard phrase-based ap-
toolkit (Koehn et al., 2007). proach (Zens et al., 2002). In this framework,
On the Romanian side, the data is tokenized us- translation is divided into two steps. To trans-
ing LIMSI’s tokro (Allauzen et al., 2016), a rule- late a source sentence into a target sentence, the
based tokenizer that mainly normalizes diacritics source sentence is first reordered according to a
and splits punctuation and clitics. This data is true- set of rewriting rules so as to reproduce the tar-
cased in the same way as the English side. In addi- get word order. This generates a word lattice con-
tion, the Romanian sentences are also tagged, lem- taining the most promising source permutations,
matized, and chunked using the TTL tagger (Tufiş which is then translated. Since the translation step
et al., 2008). is monotonic, this approach is able to rely on the
n-gram assumption to decompose the joint proba-
3 Translation Systems
bility of a sentence pair into a sequence of bilin-
Each group contributed one or more systems. In gual units called tuples.
this section the systems are presented in alphabetic We train three Romanian 4-gram language mod-
order. els, pruning all singletons with KenLM (Heafield,
2011). We use the in-domain monolingual cor-
3.1 KIT pus, the Romanian side of the parallel corpora
The KIT system consists of a phrase-based ma- and a subset of the (out-of-domain) Common
chine translation system using additional models Crawl corpus as training data. We select in-
in rescoring. The phrase-based system is trained domain sentences from the latter using the Moore-
on all available parallel training data. The phrase Lewis (Moore and Lewis, 2010) filtering method,
Page 34 of 78
more specifically its implementation in XenC training data using the Berkeley parser (Petrov et
(Rousseau, 2013). As a result, one third of the ini- al., 2006). For model prediction during tuning and
tial corpus is removed. Finally, we make a linear decoding, we use parsed versions of the develop-
interpolation of these models, using the SRILM ment and test sets. We train the rule selection
toolkit (Stolcke, 2002). model using VW and tune the weights of the trans-
lation model using batch MIRA (Cherry and Fos-
3.3 LMU-CUNI ter, 2012). The 5-gram language model is trained
The LMU-CUNI contribution is a constrained using KenLM (Heafield et al., 2013) on the Roma-
Moses phrase-based system. It uses a simple fac- nian part of the Common Crawl corpus concate-
tored setting: our phrase table produces not only nated with the Romanian part of the training data.
the target surface form but also its lemma and mor-
phological tag. On the input, we include lemmas, 3.5 RWTH Aachen University: Hierarchical
POS tags and information from dependency parses Phrase-based System
(lemma of the parent node and syntactic relation),
all encoded as additional factors. The RWTH hierarchical setup uses the open
The main difference from a standard phrase- source translation toolkit Jane 2.3 (Vilar et al.,
based setup is the addition of a feature-rich dis- 2010). Hierarchical phrase-based translation
criminative translation model which is condi- (HPBT) (Chiang, 2007) induces a weighted syn-
tioned on both source- and target-side context chronous context-free grammar from parallel text.
(Tamchyna et al., 2016). The motivation for us- In addition to the contiguous lexical phrases, as
ing this model is to better condition lexical choices used in phrase-based translation (PBT), hierar-
by using the source context and to improve mor- chical phrases with up to two gaps are also ex-
phological and topical coherence by modeling the tracted. Our baseline model contains models
(limited left-hand side) target context. with phrase translation probabilities and lexical
We also take advantage of the target factors smoothing probabilities in both translation direc-
by using a 7-gram language model trained on se- tions, word and phrase penalty, and enhanced low
quences of Romanian morphological tags. Finally, frequency features (Chen et al., 2011). It also
our system also uses a standard lexicalized re- contains binary features to distinguish between hi-
ordering model. erarchical and non-hierarchical phrases, the glue
rule, and rules with non-terminals at the bound-
3.4 LMU aries. We use the cube pruning algorithm (Huang
The LMU system integrates a discriminative rule and Chiang, 2007) for decoding.
selection model into a hierarchical SMT system, The system uses three backoff language models
as described in (Tamchyna et al., 2014). The rule (LM) that are estimated with the KenLM toolkit
selection model is implemented using the high- (Heafield et al., 2013) and are integrated into the
speed classifier Vowpal Wabbit2 which is fully in- decoder as separate models in the log-linear com-
tegrated in Moses’ hierarchical decoder. During bination: a full 4-gram LM (trained on all data),
decoding, the rule selection model is called at each a limited 5-gram LM (trained only on in-domain
rule application with syntactic context information data), and a 7-gram word class language model
as feature templates. The features are the same as (wcLM) (Wuebker et al., 2013) trained on all data
used by Braune et al. (2015) in their string-to-tree and with a output vocabulary of 143K words.
system, including both lexical and soft source syn- The system produces 1000-best lists which are
tax features. The translation model features com- reranked using a LSTM-based (Hochreiter and
prise the standard hierarchical features (Chiang, Schmidhuber, 1997; Gers et al., 2000; Gers et al.,
2005) with an additional feature for the rule se- 2003) language model (Sundermeyer et al., 2012)
lection model (Braune et al., 2016). and a LSTM-based bidirectional joined model
Before training, we reduce the number of trans- (BJM) (Sundermeyer et al., 2014a). The mod-
lation rules using significance testing (Johnson et els have a class-factored output layer (Goodman,
al., 2007). To extract the features of the rule se- 2001; Morin and Bengio, 2005) to speed up train-
lection model, we parse the English part of our ing and evaluation. The language model uses 3
2
http://hunch.net/˜vw/ (VW). Implemented by stacked LSTM layers, with 350 nodes each. The
John Langford and many others. BJM has a projection layer, and computes a for-
Page 35 of 78
ward recurrent state encoding the source and target URLs, e-mail addresses, etc.). During translation
history, a backward recurrent state encoding the a rule-based localisation feature is applied.
source future, and a third LSTM layer to combine
them. All layers have 350 nodes. The neural net- 3.8 Edinburgh/LMU Hierarchical System
works are implemented using an extension of the
The UEDIN-LMU HPBT system is a hierarchi-
RWTHLM toolkit (Sundermeyer et al., 2014b).
cal phrase-based machine translation system (Chi-
The parameter weights are optimized with MERT
ang, 2005) built jointly by the University of Ed-
(Och, 2003) towards the B LEU metric.
inburgh and LMU Munich. The system is based
3.6 RWTH Neural System on the open source Moses implementation of the
hierarchical phrase-based paradigm (Hoang et al.,
The second system provided by the RWTH is an
2009). In addition to a set of standard features in a
attention-based recurrent neural network similar
log-linear combination, a number of non-standard
to (Bahdanau et al., 2015). The implementation
enhancements are employed to achieve improved
is based on Blocks (van Merriënboer et al., 2015)
translation quality.
and Theano (Bergstra et al., 2010; Bastien et al.,
2012). Specifically, we integrate individual language
The network uses the 30K most frequent words models trained over the separate corpora (News
on the source and target side as input vocabulary. Crawl 2015, Europarl, SETimes2) directly into
The decoder and encoder word embeddings are of the log-linear combination of the system and let
size 620. The encoder uses a bidirectional layer MIRA (Cherry and Foster, 2012) optimize their
with 1024 GRUs (Cho et al., 2014) to encode the weights along with all other features in tuning,
source side, while the decoder uses 1024 GRU rather than relying on a single linearly interpolated
layer. language model. We add another background lan-
The network is trained for up to 300K updates guage model estimated over a concatenation of all
with a minibatch size of 80 using Adadelta (Zeiler, Romanian corpora including Common Crawl. All
2012). The network is evaluated every 10000 up- language models are unpruned.
dates on B LEU and the best network on the news- For hierarchical rule extraction, we impose less
dev2016/1 dev set is selected as the final network. strict extraction constraints than the Moses de-
The monolingual News Crawl 2015 corpus is faults. We extract more hierarchical rules by al-
translated into English with a simple phrase-based lowing for a maximum of ten symbols on the
translation system to create additional parallel source side, a maximum span of twenty words,
training data. The new data is weighted by us- and no lower limit to the amount of words cov-
ing the News Crawl 2015 corpus (2.3M sentences) ered by right-hand side non-terminals at extraction
once, the Europarl corpus (0.4M sentences) twice time. We discard rules with non-terminals on their
and the SETimes2 corpus (0.2M sentences) three right-hand side if they are singletons in the train-
times. The final system is an ensemble of 4 net- ing data.
works, all with the same configuration and training In order to promote better reordering decisions,
settings. we implemented a feature in Moses that resem-
bles the phrase orientation model for hierarchical
3.7 Tilde machine translation as described by Huck et al.
The Tilde system is a phrase-based machine trans- (2013) and extend our system with it. The model
lation system built on LetsMT infrastructure (Vasi- scores orientation classes (monotone, swap, dis-
jevs et al., 2012) that features language-specific continuous) for each rule application in decoding.
data filtering and cleaning modules. Tilde’s sys- We finally follow the approach outlined by
tem was trained on all available parallel data. Huck et al. (2011) for lightly-supervised train-
Two language models are trained using KenLM ing of hierarchical systems. We automatically
(Heafield, 2011): 1) a 5-gram model using the translate parts (1.2M sentences) of the monolin-
Europarl and SETimes2 corpora, and 2) a 3-gram gual Romanian News Crawl 2015 corpus to En-
model using the Common Crawl corpus. We also glish with a Romanian→English phrase-based sta-
apply a custom tokenization tool that takes into tistical machine translation system (Williams et
account specifics of the Romanian language and al., 2016). The foreground phrase table extracted
handles non-translatable entities (e.g., file paths, from the human-generated parallel data is filled
Page 36 of 78
up with entries from a background phrase table built from only News Crawl 2015, with singleton
extracted from the automatically produced News 3-grams and above pruned out. The weights of
Crawl 2015 parallel data. all these features and models are tuned with k-best
Huck et al. (2016) give a more in-depth descrip- MIRA (Cherry and Foster, 2012) on first the half
tion of the Edinburgh/LMU hierarchical machine of newsdev2016. In decoding, we use MBR (Ku-
translation system, along with detailed experimen- mar and Byrne, 2004), cube-pruning (Huang and
tal results. Chiang, 2007) with a pop-limit of 5000, and the
Moses ”monotone at punctuation” switch (to pre-
3.9 Edinburgh Neural System vent reordering across punctuation) (Koehn and
Edinburgh’s neural machine translation system Haddow, 2009).
is an attentional encoder-decoder (Bahdanau et
3.11 USFD Phrase-based System
al., 2015), which we train with nematus.3 We
use byte-pair-encoding (BPE) to achieve open- USFD’s phrase-based system is built using the
vocabulary translation with a fixed vocabulary of Moses toolkit, with MGIZA (Gao and Vogel,
subword symbols (Sennrich et al., 2016c). We 2008) for word alignment and KenLM (Heafield
produce additional parallel training data by auto- et al., 2013) for language model training. We use
matically translating the monolingual Romanian all available parallel data for the translation model.
News Crawl 2015 corpus into English (Sennrich A single 5-gram language model is built using all
et al., 2016b), which we combine with the original the target side of the parallel data and a subpart of
parallel data in a 1-to-1 ratio. We use minibatches the monolingual Romanian corpora selected with
of size 80, a maximum sentence length of 50, word Xenc-v2 (Rousseau, 2013). For the latter we use
embeddings of size 500, and hidden layers of size all the parallel data as in-domain data and the first
1024. We apply dropout to all layers (Gal, 2015), half of newsdev2016 as development set. The fea-
with dropout probability 0.2, and also drop out full ture weights are tuned with MERT (Och, 2003) on
words with probability 0.1. We clip the gradient the first half of newsdev2016.
norm to 1.0 (Pascanu et al., 2013). We train the The system produces distinct 1000-best lists,
models with Adadelta (Zeiler, 2012), reshuffling for which we extend the feature set with the
the training corpus between epochs. We validate 17 baseline black-box features from sentence-
the model every 10 000 minibatches via B LEU on level Quality Estimation (QE) produced with
a validation set, and perform early stopping on Quest++4 (Specia et al., 2015). The 1000-best
B LEU. Decoding is performed with beam search lists are then reranked and the top-best hypothesis
with a beam size of 12. extracted using the nbest rescorer available within
A more detailed description of the system, and the Moses toolkit.
more experimental results, can be found in (Sen-
3.12 UvA
nrich et al., 2016a).
We use a phrase-based machine translation sys-
3.10 Edinburgh Phrase-based System tem (Moses) with a distortion limit of 6 and lex-
Edinburgh’s phrase-based system is built using icalized reordering. Before translation, the En-
the Moses toolkit, with fast align (Dyer et al., glish source side is preordered using the neural
2013) for word alignment, and KenLM (Heafield preordering model of (de Gispert et al., 2015). The
et al., 2013) for language model training. In our preordering model is trained for 30 iterations on
Moses setup, we use hierarchical lexicalized re- the full MGIZA-aligned training data. We use two
ordering (Galley and Manning, 2008), operation language models, built using KenLM. The first is
sequence model (Durrani et al., 2013), domain in- a 5-gram language model trained on all available
dicator features, and binned phrase count features. data. Words in the Common Crawl dataset that ap-
We use all available parallel data for the transla- pear fewer than 500 times were replaced by UNK,
tion model, and all available Romanian text for the and all singleton ngrams of order 3 or higher were
language model. We use two different 5-gram lan- pruned. We also use a 7-gram class-based lan-
guage models; one built from all the monolingual guage model, trained on the same data. 512 word
target text concatenated, without pruning, and one 4
http://www.quest.dcs.shef.ac.uk/
quest_files/features_blackbox_baseline_
3
https://github.com/rsennrich/nematus 17
Page 37 of 78
the large building

newsdev2016/1 and newsdev2016/2. The first part
was used as development set while the second
the large home
part was our internal test set. Additionally we
a big house extracted 2000 sentences from the Europarl and
a huge house SETimes2 data to create two additional develop-
ment and test sets. Most single systems are op-
Figure 1: System A: the large building; System B: timized for newsdev2016/1 and/or the SETimes2
the large home; System C: a big house; System D: test set. The system combination was optimized
a huge house; Reference: the big house. on the newsdev2016/1 set.
The single system scores in Table 1 show
classes were generated using the method of Green clearly that the UEDIN NMT system is the
et al. (2014). strongest single system by a large margin. The
other standalone attention-based neural network
4 System Combination contribution, RWTH NMT, follows, with only a
small margin before the phrase-based contribu-
System combination produces consensus transla-
tions. The combination of all systems improved
tions from multiple hypotheses which are obtained
the strongest system by another 1.9 B LEU points
from different translation approaches, i.e., the sys-
on our internal test set, newsdev2016/2, and by 1
tems described in the previous section. A system
B LEU point on the official test set, newstest2016.
combination implementation developed at RWTH
Removing the strongest system from our sys-
Aachen University (Freitag et al., 2014a) is used to
tem combination shows a large degradation of the
combine the outputs of the different engines. The
results. The combination is still slightly stronger
consensus translations outperform the individual
then the UEDIN NMT system on newsdev2016/2,
hypotheses in terms of translation quality.
but lags behind on newstest2016. Removing the
The first step in system combination is the gen-
by itself weakest system shows a slight degrada-
eration of confusion networks (CN) from I in-
tion on newsdev2016/2 and newstest2016, hinting
put translation hypotheses. We need pairwise
that it still provides valuable information.
alignments between the input hypotheses, which
Table 2 shows a comparison between all sys-
are obtained from METEOR (Banerjee and Lavie,
tems by scoring the translation output against each
2005). The hypotheses are then reordered to match
other in T ER and B LEU. We see that the neural
a selected skeleton hypothesis in terms of word or-
networks outputs differ the most from all the other
dering. We generate I different CNs, each having
systems.
one of the input systems as the skeleton hypothe-
sis, and the final lattice is the union of all I gen- 6 Morphology Prediction Precision
erated CNs. In Figure 1 an example of a confu-
sion network with I = 4 input translations is de- In order to assess how well the different system
picted. Decoding of a confusion network finds the outputs predict the right morphology, we compute
best path in the network. Each arc is assigned a a precision rate for each Romanian morphologi-
score of a linear model combination of M differ- cal attribute that occurs with nouns, pronouns, ad-
ent models, which includes word penalty, 3-gram jectives, determiners, and verbs (Table 3). For
language model trained on the input hypotheses, a this purpose, we use the METEOR toolkit (Baner-
binary primary system feature that marks the pri- jee and Lavie, 2005) to obtain word alignments
mary hypothesis, and a binary voting feature for between each system translation and the refer-
each system. The binary voting feature for a sys- ence translation for newstest2016. The reference
tem is 1 if and only if the decoded word is from and hypotheses are tagged with TTL (Tufiş et al.,
that system, and 0 otherwise. The different model 2008).5 Each word in the reference that is assigned
weights for system combination are trained with a POS tag of interest (noun, pronoun, adjective,
MERT (Och, 2003). determiner, or verb) is then compared to the word
it is aligned to in the system output. When, for
5 Experimental Evaluation 5
The hypotheses were tagged despite the risks that go
along with tagging automatically generated sentences. A dic-
Since only one development set was provided we tionary would have been a solution, but unfortunately we had
split the given development set into two parts: no such resource for Romanian.
Page 38 of 78
newsdev2016/1 newsdev2016/2 newstest2016

Individual Systems B LEU T ER B LEU T ER B LEU T ER
KIT 25.2 57.5 29.9 51.8 26.3 55.9
LIMSI 23.3 59.5 27.2 55.0 23.9 59.2
LMU-CUNI 23.4 60.4 28.4 53.5 24.7 58.1
LMU 23.3 60.5 28.6 53.8 24.5 58.5
RWTH HPBT 25.4 58.7 29.3 53.3 25.9 57.6
RWTH NMT 25.1 57.4 30.6 49.6 26.5 55.4
Tilde 21.3 62.7 25.8 56.3 23.2 60.2
UEDIN-LMU HPBT 24.8 58.7 30.1 52.3 25.4 57.7
UEDIN PBT 24.7 59.3 29.1 53.2 25.2 58.1
UEDIN NMT 26.8 56.1 31.4 50.3 27.9 54.5
USFD 22.9 60.4 27.8 54.0 24.4 58.5
UvA 22.1 61.0 27.7 54.2 24.1 58.7
System Combination 28.7 55.5 33.3 49.0 28.9 54.2
- without UEDIN NMT 27.4 56.6 31.6 50.9 27.5 55.4
- without Tilde 28.8 55.5 33.0 49.5 28.7 54.5
Table 1: Results of the individual systems for the English→Romanian task. B LEU [%] and T ER [%]
scores are case-sensitive.
UEDIN-LMU HPBT
RWTH HPBT
UEDIN NMT
RWTH NMT
UEDIN PBT
LMU-CUNI
Average
LIMSI
USFD
LMU
Tilde
UvA
KIT
KIT - 55.0 55.9 51.7 56.2 48.2 50.3 54.6 55.1 42.8 56.6 54.1 52.8
LIMSI 29.3 - 54.3 52.1 51.8 43.0 49.8 55.3 56.2 38.2 57.3 52.1 51.4
LMU-CUNI 28.5 30.8 - 52.4 53.3 43.8 55.4 56.0 56.6 39.3 58.6 56.6 52.9
LMU 31.2 32.0 31.7 - 53.6 43.1 49.1 59.4 58.6 37.8 56.1 55.8 51.8
RWTH HPBT 28.5 32.4 31.2 30.8 - 47.5 50.1 54.9 55.6 41.8 53.9 55.3 52.2
RWTH NMT 33.7 37.9 37.3 37.5 34.8 - 40.8 44.3 45.3 46.0 43.8 43.6 44.5
Tilde 32.2 33.7 29.6 33.8 33.4 39.6 - 53.4 58.5 36.5 55.5 52.0 50.1
UEDIN-LMU HPBT 29.5 29.9 29.4 27.3 29.8 36.9 30.9 - 62.8 38.9 59.6 56.2 54.1
UEDIN PBT 28.4 28.9 28.5 27.0 29.3 35.4 27.0 24.2 - 39.4 60.2 58.6 55.2
UEDIN NMT 38.6 42.6 42.0 43.0 40.1 35.5 44.0 42.1 41.1 - 38.2 38.2 39.7
USFD 27.6 28.8 27.4 28.8 30.4 37.0 29.1 26.5 25.7 42.6 - 58.8 54.4
UvA 29.9 32.0 28.6 29.2 29.6 37.5 31.5 29.0 26.5 43.2 26.9 - 52.9
Average 30.7 32.6 31.4 32.0 31.8 36.6 33.2 30.5 29.3 41.3 30.0 31.3 -
Table 2: Comparison of system outputs against each other, generated by computing B LEU and T ER on
the system translations for newstest2016. One system in a pair is used as the reference, the other as
candidate translation; we report the average over both directions. The upper-right half lists B LEU [%]
scores, the lower-left half T ER [%] scores.
Page 39 of 78
UEDIN-LMU HPBT
RWTH HPBT
UEDIN NMT
RWTH NMT
Combination
UEDIN PBT
LMU-CUNI
LIMSI
USFD
LMU
Tilde
UvA
KIT
Attribute
Case 46.7% 46.0% 46.3% 45.7% 47.7% 48.0% 44.4% 46.3% 47.4% 49.8% 45.4% 45.4% 50.8%
Definite 50.5% 49.1% 50.0% 49.2% 50.5% 50.1% 47.2% 50.0% 50.5% 51.0% 49.2% 48.9% 53.3%
Gender 51.9% 51.0% 51.9% 51.3% 52.6% 52.1% 49.6% 51.9% 52.7% 53.0% 51.2% 50.9% 54.9%
Number 53.2% 51.7% 52.6% 52.3% 53.6% 53.7% 50.6% 52.9% 53.6% 54.9% 52.1% 51.8% 56.3%
Person 52.8% 51.3% 52.0% 52.0% 53.5% 55.0% 50.6% 52.6% 53.4% 57.2% 52.4% 51.6% 57.1%
Tense 45.8% 44.1% 44.7% 44.8% 45.7% 45.5% 42.3% 45.2% 45.1% 46.6% 44.9% 44.8% 48.0%
Verb form 45.9% 44.4% 45.5% 44.9% 46.6% 47.0% 43.9% 46.1% 46.5% 47.2% 45.5% 43.3% 48.7%
Reference words
57.7% 56.7% 57.3% 57.3% 58.3% 57.6% 55.7% 58.0% 58.5% 58.3% 57.3% 56.8% 60.4%
with alignment
Table 3: Precision of each system on morphological attribute prediction computed over the reference
translation using METEOR alignments. The last row shows the ratio of reference words for which
METEOR managed to find an alignment in the hypothesis.
a given morphological attribute, the output and Acknowledgments

the reference have the same value (e.g. Num-
This project has received funding from the
ber=Singular), we consider the prediction correct.
European Union’s Horizon 2020 research and
The prediction is considered wrong in every other
innovation programme under grant agreements
case.
№ 645452 (QT21) and 644402 (HimL).
The last row in Table 3 shows the ratio of ref-
erence words for which METEOR found an align-
ment in the hypothesis. We observe a high cor- References
relation between this ratio and the quality of the
Alexandre Allauzen, Lauriane Aufrant, Franck Burlot,
morphological predictions, showing that the accu-
Elena Knyazeva, Thomas Lavergne, and François
racy is highly dependent on the alignments. We Yvon. 2016. LIMSI@WMT’16 : Machine transla-
nevertheless observe that the predictions made by tion of news. In Proc. of the ACL 2016 First Conf. on
UEDIN NMT are strictly all better than UEDIN Machine Translation (WMT16), Berlin, Germany,
PBT, although the latter has slightly more align- August.
ments to the reference. The system combination Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
makes the most accurate predictions for almost ev- gio. 2015. Neural Machine Translation by Jointly
ery attribute. The difference in precision with the Learning to Align and Translate. In Proceedings of
the International Conference on Learning Represen-
best single system (UEDIN NMT) can be signifi-
tations (ICLR).
cant (2.3% for definite and 1.4% for tense) show-
ing that the combination managed to effectively Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
identify the strong points of each translation sys- An Automatic Metric for MT Evaluation with Im-
proved Correlation with Human Judgments. In 43rd
tem. Annual Meeting of the Assoc. for Computational
Linguistics: Proc. Workshop on Intrinsic and Extrin-
7 Conclusion sic Evaluation Measures for MT and/or Summariza-
tion, pages 65–72, Ann Arbor, MI, USA, June.
Our combined effort shows that even with an ex-
tremely strong single best system, we still manage Frédéric Bastien, Pascal Lamblin, Razvan Pascanu,
to improve the final result by one B LEU point by James Bergstra, Ian J. Goodfellow, Arnaud Berg-
combining it with the other systems of all partici- eron, Nicolas Bouchard, and Yoshua Bengio. 2012.
Theano: new features and speed improvements.
pating research groups. Deep Learning and Unsupervised Feature Learning
The joint submission for English→Romanian is NIPS 2012 Workshop.
the best submission measured in terms of B LEU,
as presented on the WMT submission page.6 James Bergstra, Olivier Breuleux, Frédéric Bastien,
Pascal Lamblin, Razvan Pascanu, Guillaume Des-
6
http://matrix.statmt.org/ jardins, Joseph Turian, David Warde-Farley, and
Page 40 of 78
Yoshua Bengio. 2010. Theano: a CPU and Adrià de Gispert, Gonzalo Iglesias, and Bill Byrne.
GPU math expression compiler. In Proceedings 2015. Fast and accurate preordering for SMT us-
of the Python for Scientific Computing Conference ing neural networks. In Proceedings of the 2015
(SciPy), June. Oral Presentation. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Fabienne Braune, Nina Seemann, and Alexander Language Technologies, pages 1012–1017, Denver,
Fraser. 2015. Rule Selection with Soft Syn- Colorado, May–June.
tactic Features for String-to-Tree Statistical Ma-
chine Translation. In Proc. of the Conf. on Em- Nadir Durrani, Alexander Fraser, Helmut Schmid,
pirical Methods for Natural Language Processing Hieu Hoang, and Philipp Koehn. 2013. Can
(EMNLP). Markov Models Over Minimal Translation Units
Help Phrase-Based SMT? In Proceedings of the
Fabienne Braune, Alexander Fraser, Hal Daumé III, 51st Annual Meeting of the Association for Compu-
and Aleš Tamchyna. 2016. A Framework for Dis- tational Linguistics (Volume 2: Short Papers), pages
criminative Rule Selection in Hierarchical Moses. 399–405, Sofia, Bulgaria, August.
In Proc. of the ACL 2016 First Conf. on Machine
Chris Dyer, Victor Chahuneau, and Noah A. Smith.
Translation (WMT16), Berlin, Germany, August.
2013. A Simple, Fast, and Effective Reparameter-
ization of IBM Model 2. In Proceedings of NAACL-
Francesco Casacuberta and Enrique Vidal. 2004. Ma-
HLT, pages 644–648, Atlanta, Georgia, June.
chine translation with inferred stochastic finite-state
transducers. Computational Linguistics, 30(3):205– Markus Freitag, Stephan Peitz, Joern Wuebker, Her-
225. mann Ney, Nadir Durrani, Matthias Huck, Philipp
Koehn, Thanh-Le Ha, Jan Niehues, Mohammed
Boxing Chen, Roland Kuhn, George Foster, and Mediani, Teresa Herrmann, Alex Waibel, Nicola
Howard Johnson. 2011. Unpacking and Transform- Bertoldi, Mauro Cettolo, and Marcello Federico.
ing Feature Functions: New Ways to Smooth Phrase 2013. EU-BRIDGE MT: Text Translation of Talks
Tables. In MT Summit XIII, pages 269–275, Xia- in the EU-BRIDGE Project. In Proc. of the
men, China, September. Int. Workshop on Spoken Language Translation
(IWSLT), pages 128–135, Heidelberg, Germany, De-
Colin Cherry and George Foster. 2012. Batch Tun- cember.
ing Strategies for Statistical Machine Translation. In
Proc. of the Conf. of the North American Chapter of Markus Freitag, Matthias Huck, and Hermann Ney.
the Assoc. for Computational Linguistics: Human 2014a. Jane: Open Source Machine Translation
Language Technologies (NAACL-HLT), pages 427– System Combination. In Proc. of the Conf. of the
436, Montréal, Canada, June. European Chapter of the Assoc. for Computational
Linguistics (EACL), pages 29–32, Gothenberg, Swe-
David Chiang. 2005. A Hierarchical Phrase-Based den, April.
Model for Statistical Machine Translation. In Proc.
of the 43rd Annual Meeting of the Association for Markus Freitag, Stephan Peitz, Joern Wuebker, Her-
Computational Linguistics (ACL), pages 263–270, mann Ney, Matthias Huck, Rico Sennrich, Nadir
Ann Arbor, Michigan, June. Durrani, Maria Nadejde, Philip Williams, Philipp
Koehn, Teresa Herrmann, Eunah Cho, and Alex
David Chiang. 2007. Hierarchical Phrase-Based Waibel. 2014b. EU-BRIDGE MT: Combined Ma-
Translation. Computational Linguistics, 33(2):201– chine Translation. In Proc. of the Workshop on
228. Statistical Machine Translation (WMT), pages 105–
113, Baltimore, MD, USA, June.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
Markus Freitag, Joern Wuebker, Stephan Peitz, Her-
cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
mann Ney, Matthias Huck, Alexandra Birch, Nadir
ger Schwenk, and Yoshua Bengio. 2014. Learn-
Durrani, Philipp Koehn, Mohammed Mediani, Is-
ing Phrase Representations using RNN Encoder–
abel Slawik, Jan Niehues, Eunah Cho, Alex Waibel,
Decoder for Statistical Machine Translation. In Pro-
Nicola Bertoldi, Mauro Cettolo, and Marcello Fed-
ceedings of the 2014 Conference on Empirical Meth-
erico. 2014c. Combined Spoken Language Trans-
ods in Natural Language Processing (EMNLP),
lation. In Proc. of the Int. Workshop on Spoken
pages 1724–1734, Doha, Qatar, October. Associa-
Language Translation (IWSLT), pages 57–64, Lake
tion for Computational Linguistics.
Tahoe, CA, USA, December.
Joseph M. Crego and José B. Mariño. 2006. Improving Yarin Gal. 2015. A Theoretically Grounded Appli-
statistical MT by coupling reordering and decoding. cation of Dropout in Recurrent Neural Networks.
Machine translation, 20(3):199–215, Jul. ArXiv e-prints.
Josep Maria Crego, Franois Yvon, and José B. Mariño. Michel Galley and Christopher D. Manning. 2008. A
2011. N-code: an open-source Bilingual N-gram simple and effective hierarchical phrase reordering
SMT Toolkit. Prague Bulletin of Mathematical Lin- model. In Proceedings of the Conference on Em-
guistics, 96:49–58. pirical Methods in Natural Language Processing,
Page 41 of 78
pages 848–856, Stroudsburg, PA, USA. Association Proc. of the EMNLP 2011 Workshop on Unsuper-
for Computational Linguistics. vised Learning in NLP, pages 91–96, Edinburgh,
Scotland, UK, July.
Qin Gao and Stephan Vogel. 2008. Parallel implemen-
tations of word alignment tool. In Software Engi- Matthias Huck, Joern Wuebker, Felix Rietig, and Her-
neering, Testing, and Quality Assurance for Natural mann Ney. 2013. A Phrase Orientation Model
Language Processing, pages 49–57. Association for for Hierarchical Machine Translation. In Proc. of
Computational Linguistics. the Workshop on Statistical Machine Translation
(WMT), pages 452–463, Sofia, Bulgaria, August.
Felix A. Gers, Jürgen Schmidhuber, and Fred Cum-
mins. 2000. Learning to forget: Contin- Matthias Huck, Alexander Fraser, and Barry Haddow.
ual prediction with LSTM. Neural computation, 2016. The Edinburgh/LMU Hierarchical Machine
12(10):2451–2471. Translation System for WMT 2016. In Proc. of
the ACL 2016 First Conf. on Machine Translation
Felix A. Gers, Nicol N. Schraudolph, and Jürgen
(WMT16), Berlin, Germany, August.
Schmidhuber. 2003. Learning precise timing with
lstm recurrent networks. The Journal of Machine
Howard Johnson, Joel Martin, George Foster, and
Learning Research, 3:115–143.
Roland Kuhn. 2007. Improving Translation Qual-
Joshua Goodman. 2001. Classes for fast maximum ity by Discarding Most of the Phrasetable. In Proc.
entropy training. CoRR, cs.CL/0108006. of EMNLP-CoNLL 2007.
Spence Green, Daniel Cer, and Christopher D. Man- Philipp Koehn and Barry Haddow. 2009. Edinburgh’s
ning. 2014. An empirical comparison of features Submission to all Tracks of the WMT 2009 Shared
and tuning for phrase-based machine translation. In Task with Reordering and Speed Improvements to
In Procedings of the Ninth Workshop on Statistical Moses. In Proceedings of the Fourth Workshop
Machine Translation. on Statistical Machine Translation, pages 160–164,
Athens, Greece.
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
Clark, and Philipp Koehn. 2013. Scalable Modified Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Kneser-Ney Language Model Estimation. pages Callison-Burch, Marcello Federico, Nicola Bertoldi,
690–696, Sofia, Bulgaria, August. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Kenneth Heafield. 2011. KenLM: Faster and Smaller Constantine, and Evan Herbst. 2007. Moses: Open
Language Model Queries. In Proceedings of the Source Toolkit for Statistical Machine Translation.
EMNLP 2011 Sixth Workshop on Statistical Ma- pages 177–180, Prague, Czech Republic, June.
chine Translation, pages 187–197, Edinburgh, Scot-
land, United Kingdom, July. Shankar Kumar and William Byrne. 2004. Minimum
Bayes-Risk Decoding for Statistical Machine Trans-
Teresa Herrmann, Jan Niehues, and Alex Waibel. lation. In HLT 2004 - Human Language Technology
2015. Source Discriminative Word Lexicon for Conference, Boston, MA, May.
Translation Disambiguation. In Proceedings of the
12th International Workshop on Spoken Language José B. Mariño, Rafael E. Banchs, Josep M. Crego,
Translation (IWSLT15), Danang, Vietnam. Adrià de Gispert, Patrik Lambert, José A. R. Fonol-
losa, and Marta R. Costa-jussà. 2006. N-gram-
Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.
based machine translation. Comput. Linguist.,
A Unified Framework for Phrase-Based, Hierarchi-
32(4):527–549, December.
cal, and Syntax-Based Statistical Machine Transla-
tion. In Proc. of the Int. Workshop on Spoken Lan-
R.C. Moore and W. Lewis. 2010. Intelligent Selection
guage Translation (IWSLT), pages 152–159, Tokyo,
of Language Model Training Data. In ACL (Short
Japan, December.
Papers), pages 220–224, Uppsala, Sweden, July.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation, Frederic Morin and Yoshua Bengio. 2005. Hierarchi-
9(8):1735–1780. cal probabilistic neural network language model. In
Robert G. Cowell and Zoubin Ghahramani, editors,
Liang Huang and David Chiang. 2007. Forest Rescor- Proceedings of the Tenth International Workshop on
ing: Faster Decoding with Integrated Language Artificial Intelligence and Statistics, pages 246–252.
Models. In Proceedings of the 45th Annual Meet- Society for Artificial Intelligence and Statistics.
ing of the Association for Computational Linguis-
tics, pages 144–151, Prague, Czech Republic, June. J. Niehues and A. Waibel. 2012. Detailed Analysis of
Different Strategies for Phrase Table Adaptation in
Matthias Huck, David Vilar, Daniel Stein, and Her- SMT. In Proceedings of the 10th Conference of the
mann Ney. 2011. Lightly-Supervised Training for Association for Machine Translation in the Ameri-
Hierarchical Phrase-Based Machine Translation. In cas, San Diego, CA, USA.
Page 42 of 78
Jan Niehues and Alex Waibel. 2013. An MT Error- Lucia Specia, Gustavo Paetzold, and Carolina Scar-
Driven Discriminative Word Lexicon using Sen- ton. 2015. Multi-level translation quality prediction
tence Structure Features. In Proceedings of the 8th with quest++. In Proceedings of ACL-IJCNLP 2015
Workshop on Statistical Machine Translation, Sofia, System Demonstrations, pages 115–120, Beijing,
Bulgaria. China, July. Association for Computational Linguis-
tics and The Asian Federation of Natural Language
Jan Niehues, Quoc Khanh Do, Alexandre Allauzen, Processing.
and Alex Waibel. 2015. Listnet-based MT Rescor-
ing. EMNLP 2015, page 248. Andreas Stolcke. 2002. SRILM – An Extensible Lan-
guage Modeling Toolkit. In Proc. of the Int. Conf.
Franz Josef Och. 1999. An Efficient Method for Deter- on Speech and Language Processing (ICSLP), vol-
mining Bilingual Word Classes. In Proceedings of ume 2, pages 901–904, Denver, CO, September.
the 9th Conference of the European Chapter of the
Association for Computational Linguistics, Bergen, Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
Norway. 2012. LSTM Neural Networks for Language Mod-
eling. In Interspeech, Portland, OR, USA, Septem-
Franz Josef Och. 2003. Minimum Error Rate Training
ber.
in Statistical Machine Translation. In Proc. of the
41th Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 160–167, Sapporo, Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,
Japan, July. and Hermann Ney. 2014a. Translation Modeling
with Bidirectional Recurrent Neural Networks. In
Razvan Pascanu, Tomas Mikolov, and Yoshua Ben- Conference on Empirical Methods in Natural Lan-
gio. 2013. On the difficulty of training recurrent guage Processing, pages 14–25, Doha, Qatar, Octo-
neural networks. In Proceedings of the 30th Inter- ber.
national Conference on Machine Learning, ICML
2013, pages 1310–1318, , Atlanta, GA, USA. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
2014b. rwthlm - The RWTH Aachen University
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Neural Network Language Modeling Toolkit . In In-
Klein. 2006. Learning Accurate, Compact, and In- terspeech, pages 2093–2097, Singapore, September.
terpretable Tree Annotation. In Proc. of the 21st
Int. Conf. on Computational Linguistics and the 44th Aleš Tamchyna, Fabienne Braune, Alexander M.
Annual Meeting of the Assoc. for Computational Fraser, Marine Carpuat, Hal Daumé III, and Chris
Linguistics, pages 433–440. Quirk. 2014. Integrating a Discriminative Classi-
fier into Phrase-based and Hierarchical Decoding.
Kay Rottmann and Stephan Vogel. 2007. Word Re- The Prague Bulletin of Mathematical Linguistics
ordering in Statistical Machine Translation with a (PBML), 101:29–42.
POS-Based Distortion Model. In Proceedings of
the 11th International Conference on Theoretical Aleš Tamchyna, Alexander Fraser, Ondřej Bojar, and
and Methodological Issues in Machine Translation, Marcin Junczsys-Dowmunt. 2016. Target-Side
Skövde, Sweden. Context for Discriminative Models in Statistical Ma-
chine Translation. In Proc. of ACL, Berlin, Ger-
Anthony Rousseau. 2013. Xenc: An open-source many, August. Association for Computational Lin-
tool for data selection in natural language process- guistics.
ing. The Prague Bulletin of Mathematical Linguis-
tics, (100):73–82.
Dan Tufiş, Radu Ion, Ru Ceauşu, and Dan Ştefănescu.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2008. RACAI’s Linguistic Web Services. In
2016a. Edinburgh Neural Machine Translation Sys- Proceedings of the Sixth International Language
tems for WMT 16. In Proceedings of the First Con- Resources and Evaluation (LREC’08), Marrakech,
ference on Machine Translation (WMT16), Berlin, Morocco, May. European Language Resources As-
Germany. sociation (ELRA).
Rico Sennrich, Barry Haddow, and Alexandra Birch. Bart van Merriënboer, Dzmitry Bahdanau, Vincent Du-
2016b. Improving Neural Machine Translation moulin, Dmitriy Serdyuk, David Warde-Farley, Jan
Models with Monolingual Data. In Proceedings Chorowski, and Yoshua Bengio. 2015. Blocks
of the 54th Annual Meeting of the Association for and Fuel: Frameworks for deep learning. CoRR,
Computational Linguistics (ACL 2016), Berlin, Ger- abs/1506.00619.
many.
Andrejs Vasijevs, Raivis Skadiš, and Jörg Tiedemann.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2012. LetsMT!: A Cloud-Based Platform for Do-
2016c. Neural Machine Translation of Rare Words It-Yourself Machine Translation. In Min Zhang, ed-
with Subword Units. In Proceedings of the 54th An- itor, Proceedings of the ACL 2012 System Demon-
nual Meeting of the Association for Computational strations, number July, pages 43–48, Jeju Island,
Linguistics (ACL 2016), Berlin, Germany. Korea. Association for Computational Linguistics.
Page 43 of 78

mann Ney. 2010. Jane: Open source hierarchi-
cal translation, extended with reordering and lexi-
con models. In ACL 2010 Joint Fifth Workshop on
Statistical Machine Translation and Metrics MATR,
pages 262–270, Uppsala, Sweden, July.
Philip Williams, Rico Sennrich, Maria Nădejde,
Matthias Huck, Barry Haddow, and Ondřej Bojar.
2016. Edinburgh’s Statistical Machine Translation
Systems for WMT16. In Proc. of the ACL 2016 First
Conf. on Machine Translation (WMT16), Berlin,
Germany, August.
Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
mann Ney. 2013. Improving statistical machine
translation with word class models. In Conference
cessing, pages 1377–1381, Seattle, WA, USA, Oc-
tober.
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive

Learning Rate Method. CoRR, abs/1212.5701.
Richard Zens, Franz Josef Och, and Hermann Ney.
2002. Phrase-Based Statistical Machine Transla-
tion. In 25th German Conf. on Artificial Intelligence
(KI2002), pages 18–32, Aachen, Germany, Septem-
ber. Springer Verlag.
Page 44 of 78
D Beer for MT evaluation and tuning
BEER 1.1: ILLC UvA submission to metrics and tuning task
Miloš Stanojević Khalil Sima’an

ILLC ILLC
University of Amsterdam University of Amsterdam
m.stanojevic@uva.nl k.simaan@uva.nl
Abstract 2. including syntactic features and
We describe the submissions of ILLC 3. removing the recall bias from BEER .
UvA to the metrics and tuning tasks on
WMT15. Both submissions are based In Section 2 we give a short introduction to
on the BEER evaluation metric origi- BEER after which we move to the innovations for
nally presented on WMT14 (Stanojević this year in Sections 3, 4 and 5. We show the re-
and Sima’an, 2014a). The main changes sults from the metric and tuning tasks in Section 6,
introduced this year are: (i) extending and conclude in Section 7.
the learning-to-rank trained sentence level
2 BEER basics
metric to the corpus level (but still decom-
posable to sentence level), (ii) incorporat- The model underying the BEER metric is flexible
ing syntactic ingredients based on depen- for the integration of an arbitrary number of new
dency trees, and (iii) a technique for find- features and has a training method that is targeted
ing parameters of BEER that avoid “gam- for producing good rankings among systems. Two
ing of the metric” during tuning. other characteristic properties of BEER are its hi-
erarchical reordering component and character n-
1 Introduction grams lexical matching component.
In the 2014 WMT metrics task, BEER turned up as 2.1 Old BEER scoring
the best sentence level evaluation metric on aver-
age over 10 language pairs (Machacek and Bojar, BEER is essentially a linear model with which the
2014). We believe that this was due to: score can be computed in the following way:
1. learning-to-rank - type of training that allows X

score(h, r) = wi × φi (h, r) = w ~
~ ·φ
a large number of features and also training
on the same objective on which the model is i
going to be evaluated : ranking of translations ~ is a feature

where w
~ is a weight vector and φ
2. dense features - character n-grams and skip- vector.
bigrams that are less sparse on the sentence
2.2 Learning-to-rank
level than word n-grams
Since the task on which our model is going to
3. permutation trees - hierarchical decompo- be evaluated is ranking translations it comes natu-
sition of word order based on (Zhang and ral to train the model using learning-to-rank tech-
Gildea, 2007) niques.
Our training data consists of pairs of “good”
A deeper analysis of (2) is presented in (Stano-
and “bad” translations. By using a feature vector
jević and Sima’an, 2014c) and of (3) in (Stanojević ~ good for a good translation and a feature vector
φ
and Sima’an, 2014b).
~ bad for a bad translation then using the following
φ
Here we modify BEER by
equations we can transform the ranking problem
1. incorporating a better scoring function that into a binary classification problem (Herbrich et
give scores that are better scaled al., 1999):
396
Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 396–401,
Lisboa, Portugal, 17-18 September 2015. c 2015 Association for Computational Linguistics.
Page 45 of 78
that there is no need to do any reordering (every-

thing is in the right place). BEER has features
score(hgood , r) > score(hbad , r) ⇔ that compute the number of different node types
w~ ·φ ~ good > w ~ bad ⇔
~ ·φ and for each different type it assigns a different
~ good − w ~ bad > 0 ⇔ weight. Sometimes there are more than one PET
w
~ ·φ ~ ·φ
for the same permutation. Consider Figure 1b and
w ~
~ · (φgood − φ~ bad ) > 0 1c which are just 2 out of 3 possible PETs for per-
w ~ bad − φ
~ · (φ ~ good ) < 0 mutation h4, 3, 2, 1i. Counting the number of trees
that could be built is also a good indicator of the
~ good − φ
If we look at φ ~ bad as a positive training permutation quality. See (Stanojević and Sima’an,
~ ~ 2014b) for details on using PETs for evaluating
instance and at φbad − φgood as a negative training
word order.
instance, we can train any linear classifier to find
weight the weight vector w ~ that minimizes mis- 3 Corpus level BEER
takes in ranking on the training set.
Our goal here is to create corpus level extension
2.3 Lexical component based on character of BEER that decomposes trivially at the sentence
n-grams level. More concretely we wanted to have a corpus
Lexical scoring of BEER relies heavily on charac- level BEER that would be the average of the sen-
ter n-grams. Precision, Recall and F1-score are tence level BEER of all sentences in the corpus:
used with character n-gram orders from 1 until
P
6. These scores are more smooth on the sentence
si ∈c BEERsent (si )
level than word n-gram matching that is present in BEERcorpus (c) = (1)
|c|
other metrics like BLEU (Papineni et al., 2002) or
METEOR (Michael Denkowski and Alon Lavie, In order to do so it is not suitable to to use pre-
2014). vious scoring function of BEER . The previous
BEER also uses precision, recall and F1-score scoring function (and training method) take care
on word level (but not with word n-grams). only that the better translation gets a higher score
Matching of words is computed over METEOR than the worse translation (on the sentence level).
alignments that use WordNet, paraphrasing and For this kind of corpus level computations we have
stemming to have more accurate alignment. an additional requirement that our sentence level
We also make distinction between function and scores need to be scaled proportional to the trans-
content words. The more precise description of lation quality.
used features and their effectiveness is presented
in (Stanojević and Sima’an, 2014c). 3.1 New BEER scoring function
To make the scores on the sentence level better
2.4 Reordering component based on PETs scaled we transform our linear model into a prob-
The word alignments between system and refer- abilistic linear model – logistic regression with the
ence translation can be simplified and considered following scoring function:
as permutation of words from the reference trans-
lation in the system translation. Previous work 1
by (Isozaki et al., 2010) and (Birch and Osborne, score(h, r) =
1 + e− wi ×φi (h,r)
P
i
2010) used this permutation view of word order
and applied Kendall τ for evaluating its distance There is still a problem with this formulation.
from ideal (monotone) word order. During training, the model is trained on the differ-
BEER goes beyond this skip-gram based eval- ence between two feature vectors φ ~ good − φ
~ bad ,
uation and decomposes permutation into a hierar- while during testing it is applied only to one fea-
chical structure which shows how subparts of per- ture vector φ~ test . φ
~ good − φ
~ bad is usually very
mutation form small groups that can be reordered close to the separating hyperplane, whereas φ ~ test
all together. Figure 1a shows PET for permuta- is usually very far from it. This is not a problem
tion h2, 5, 6, 4, 1, 3i. Ideally the permutation tree for ranking but it presents a problem if we want
will be filled with nodes h1, 2i which would say well scaled scores. Being extremely far from the
397
Page 46 of 78
h2, 4, 1, 3i h2, 1i
h2, 1i
h2, 1i 1
2 h2, 1i 1 3
h2, 1i h2, 1i h2, 1i 2
h1, 2i 4
5 6 4 3 2 1 4 3
(a) Complex PET (b) Fully inverted PET 1 (c) Fully inverted PET 2
Figure 1: Examples of PETs
separated hyperplane gives extreme scores such as 1. POS bigrams matching

0.9999999999912 and 0.00000000000000213 as
a result which are obviously not well scaled. 2. dependency words bigram matching
Our model was trained to give a probability of 3. arc type matching
the “good” translation being better than the “bad”
translation so we should also use it in that way – 4. valency matching
to estimate the probability of one translation being
For each of these we compute precision, recall
better than the other. But which translation? We
and F1-score.
are given only one translation and we need to com-
It has been shown by other researchers (Popović
pute its score. To avoid this problem we pretend
and Ney, 2009) that POS tags are useful for ab-
that we are computing a probability of the test sen-
stracting away from concrete words and measure
tence being a better translation than the reference
the grammatical aspect of translation (for example
for the given reference. In the ideal case the sys-
it can captures agreement).
tem translation and the reference translation will
Dependency word bigrams (bigrams connected
have the same features which will make logistic
by a dependency arc) are also useful for capturing
regression output probability 0.5 (it is uncertain
long distance dependencies.
about which translation is the better one). To make
Most of the previous metrics that work with de-
the scores between 0 and 1 we multiply this result
pendency trees usually ignore the type of the de-
with 2. The final scoring formula is the following:
pendency that is (un)matched and treat all types
equally (Yu et al., 2014). This is clearly not the
2
score(h, r) = case. Surely subject and complement arcs are
1 + e− wi ×(φi (h,r)−φi (r,r))
P
i
more important than modifier arc. To capture this
4 BEER + Syntax = BEER Treepel we created individual features for precision, re-
The standard version of BEER does not use any call and F1-score matching of each arc type so our
syntactic knowledge. Since the training method system could learn on which arc type to put more
of BEER allows the usage of a large number of weight.
features, it is trivial to integrate new features that All words take some number of arguments (va-
would measure the matching between some syntax lency), and not matching that number of argu-
attributes of system and reference translations. ments is a sign of a, potentially, bad translation.
The syntactic representation we exploit is a de- With this feature we hope to capture the aspect of
pendency tree. The reason for that is that we can not producing the right number of arguments for
easily connect the structure with the lexical con- all words (and especially verbs) in the sentence.
tent and it is fast to compute which can often be This model BEER Treepel contains in total 177
very important for evaluation metrics when they features out of which 45 are from original BEER .
need to evaluate on large data. We used Stanford’s
5 BEER for tuning
dependency parser (Chen and Manning, 2014) be-
cause it gives high accuracy parses in a very short The metrics that perform well on metrics task are
time. very often not good for tuning. This is because
The features we compute on the dependency recall has much more importance for human judg-
trees of the system and its reference translation ment than precision. The metrics that put more
are: weight on recall than precision will be better with
398
Page 47 of 78
tuning metric BLEU MTR BEER Length System Name TrueSkill Score BLEU
Tuning-Only All
BEER 16.4 28.4 10.2 115.7 BLEU -MIRA- DENSE 0.153 -0.177 12.28
BLEU 18.2 28.1 10.1 103.0 ILLC-U VA 0.108 -0.188 12.05
BLEU -MERT- DENSE 0.087 -0.200 12.11
BEER no bias 18.0 27.7 9.8 99.7
AFRL 0.070 -0.205 12.20
USAAR-T UNA 0.011 -0.220 12.16
Table 1: Tuning results with BEER without bias DCU -0.027 -0.256 11.44
on WMT14 as tuning and WMT13 as test set METEOR-CMU -0.101 -0.286 10.88
BLEU -MIRA- SPARSE -0.150 -0.331 10.84
HKUST -0.150 -0.331 10.99
HKUST-LATE — — 12.20
correlation with human judgment, but when used
for tuning they will create overly long translations. Table 6: Results on Czech-English tuning
This bias for long translation is often resolved
by manually setting the weights of recall and pre-
cision to be equal (Denkowski and Lavie, 2011;
The difference between BEER and
He and Way, 2009).
BEER Treepel are relatively big for de-en,
This problem is even bigger with metrics with
cs-en and ru-en while for fr-en and fi-en the
many features. When we have metric like
difference does not seem to be big.
BEER Treepel which has 117 features it is not
clear how to set weights for each feature manu- The results of WMT15 tuning task is shown in
ally. Also some features might not have easy inter- Table 6. The system tuned with BEER without re-
pretation as precision or recall of something. Our call bias was the best submitted system for Czech-
method for automatic removing of this recall bias, English and only the strong baseline outperformed
which is presented in (Stanojević, 2015), gives it.
very good results that can be seen in Table 1.
Before the automatic adaptation of weights 7 Conclusion
for tuning, tuning with standard BEER produces
translations that are 15% longer than the refer- We have presented ILLC UvA submission to the
ence translations. This behavior is rewarded by shared metric and tuning task. All submissions
metrics that are recall-heavy like METEOR and are centered around BEER evaluation metric. On
BEER and punished by precision heavy metrics the metrics task we kept the good results we had
like BLEU. After automatic adaptation of weights, on sentence level and extended our metric to cor-
tuning with BEER matches the length of reference pus level with high correlation with high human
translation even better than BLEU and achieves judgment without losing the decomposability of
the BLEU score that is very close to tuning with the metric to the sentence level. Integration of syn-
BLEU. This kind of model is disliked by ME- tactic features gave a bit of improvement on some
TEOR and BEER but by just looking at the length language pairs. The removal of recall bias allowed
of the produced translations it is clear which ap- us to go from overly long translations produced
proach is preferred. in tuning to translations that match reference rel-
atively close by length and won the 3rd place in
6 Metric and Tuning task results the tuning task. BEER is available at https:
//github.com/stanojevic/beer.
The results of WMT15 metric task of best per-
forming metrics is shown in Tables 2 and 3 for the
system level and Tables 4 and 5 for segment level. Acknowledgments
On the sentence level for out of English lan-
guage pairs on average BEER was the best met- This work is supported by STW grant nr. 12271
ric (same as the last year). Into English it got 2nd and NWO VICI grant nr. 277-89-002. QT21
place with its syntactic version and 4th place as the project support to the second author is also
original BEER . acknowledged (European Unions Horizon 2020
On the corpus level BEER is on average second grant agreement no. 64545). We are thankful to
for out of English language pairs and 6th for into Christos Louizos for help with incorporating a de-
English. BEER and BEER Treepel are the best for pendency parser to BEER Treepel.
en-ru and fi-en.
399
Page 48 of 78
Correlation coefficient Pearson Correlation Coefficient

Direction fr-en fi-en de-en cs-en ru-en Average
DPMF COMB .995 ± .006 .951 ± .013 .949 ± .016 .992 ± .004 .871 ± .025 .952 ± .013
RATATOUILLE .989 ± .010 .899 ± .019 .942 ± .018 .963 ± .008 .941 ± .018 .947 ± .014
DPMF .997 ± .005 .939 ± .015 .929 ± .019 .986 ± .005 .868 ± .026 .944 ± .014
METEOR-WSD .982 ± .011 .944 ± .014 .914 ± .021 .981 ± .006 .857 ± .026 .936 ± .016
CHR F3 .979 ± .012 .893 ± .020 .921 ± .020 .969 ± .007 .915 ± .023 .935 ± .016
BEER T REEPEL .981 ± .011 .957 ± .013 .905 ± .021 .985 ± .005 .846 ± .027 .935 ± .016
BEER .979 ± .012 .952 ± .013 .903 ± .022 .975 ± .006 .848 ± .027 .931 ± .016
CHR F .997 ± .005 .942 ± .015 .884 ± .024 .982 ± .006 .830 ± .029 .927 ± .016
L E BLEU- OPTIMIZED .989 ± .009 .895 ± .020 .856 ± .025 .970 ± .007 .918 ± .023 .925 ± .017
L E BLEU- DEFAULT .960 ± .015 .895 ± .020 .856 ± .025 .946 ± .010 .912 ± .022 .914 ± .018
Table 2: System-level correlations of automatic evaluation metrics and the official WMT human scores
when translating into English.
Correlation coefficient Pearson Correlation Coefficient

Metric en-fr en-fi en-de en-cs en-ru Average
CHR F3 .949 ± .021 .813 ± .025 .784 ± .028 .976 ± .004 .913 ± .011 .887 ± .018
BEER .970 ± .016 .729 ± .030 .811 ± .026 .951 ± .005 .942 ± .009 .880 ± .017
L E BLEU- DEFAULT .949 ± .020 .760 ± .028 .827 ± .025 .946 ± .005 .849 ± .014 .866 ± .018
RATATOUILLE .962 ± .017 .675 ± .031 .777 ± .028 .953 ± .005 .869 ± .013 .847 ± .019
CHR F .949 ± .021 .771 ± .027 .572 ± .037 .968 ± .004 .871 ± .013 .826 ± .020
METEOR-WSD .961 ± .018 .663 ± .032 .495 ± .039 .941 ± .005 .839 ± .014 .780 ± .022
BS −.977 ± .014 .334 ± .039 −.615 ± .036 −.947 ± .005 −.791 ± .016 −.600 ± .022
DPMF .973 ± .015 n/a .584 ± .037 n/a n/a .778 ± .026
Table 3: System-level correlations of automatic evaluation metrics and the official WMT human scores
when translating out of English.
Direction fr-en fi-en de-en cs-en ru-en Average

DPMF COMB .367 ± .015 .406 ± .015 .424 ± .015 .465 ± .012 .358 ± .014 .404 ± .014
BEER T REEPEL .358 ± .015 .399 ± .015 .386 ± .016 .435 ± .013 .352 ± .013 .386 ± .014
RATATOUILLE .367 ± .015 .384 ± .015 .380 ± .015 .442 ± .013 .336 ± .014 .382 ± .014
BEER .359 ± .015 .392 ± .015 .376 ± .015 .417 ± .013 .336 ± .013 .376 ± .014
METEOR-WSD .347 ± .015 .376 ± .015 .360 ± .015 .416 ± .013 .331 ± .014 .366 ± .014
CHR F .350 ± .015 .378 ± .015 .366 ± .016 .407 ± .013 .322 ± .014 .365 ± .014
DPMF .344 ± .014 .368 ± .015 .363 ± .015 .413 ± .013 .320 ± .014 .362 ± .014
CHR F3 .345 ± .014 .361 ± .016 .360 ± .015 .409 ± .012 .317 ± .014 .359 ± .014
L E BLEU- DEFAULT .343 ± .015 .342 ± .015 .341 ± .014 .394 ± .013 .317 ± .014 .347 ± .014
TOTAL -BS −.305 ± .013 −.277 ± .015 −.287 ± .014 −.357 ± .013 −.263 ± .014 −.298 ± .014
Table 4: Segment-level Kendall’s τ correlations of automatic evaluation metrics and the official WMT
human judgments when translating into English. The last three columns contain average Kendall’s τ
computed by other variants.
Direction en-fr en-fi en-de en-cs en-ru Average

BEER .323 ± .013 .361 ± .013 .355 ± .011 .410 ± .008 .415 ± .012 .373 ± .011
CHR F3 .309 ± .013 .357 ± .013 .345 ± .011 .408 ± .008 .398 ± .012 .363 ± .012
RATATOUILLE .340 ± .013 .300 ± .014 .337 ± .011 .406 ± .008 .408 ± .012 .358 ± .012
L E BLEU- DEFAULT .321 ± .013 .354 ± .013 .345 ± .011 .385 ± .008 .386 ± .012 .358 ± .011
CHR F .317 ± .013 .346 ± .012 .315 ± .013 .407 ± .008 .387 ± .012 .355 ± .012
METEOR-WSD .316 ± .013 .270 ± .013 .287 ± .012 .363 ± .008 .373 ± .012 .322 ± .012
TOTAL -BS −.269 ± .013 −.205 ± .012 −.231 ± .011 −.324 ± .008 −.332 ± .012 −.273 ± .011
DPMF .308 ± .013 n/a .289 ± .012 n/a n/a .298 ± .013
PARMESAN n/a n/a n/a .089 ± .006 n/a .089 ± .006
Table 5: Segment-level Kendall’s τ correlations of automatic evaluation metrics and the official WMT
human judgments when translating out of English. The last three columns contain average Kendall’s τ
computed by other variants.
400
Page 49 of 78
References Miloš Stanojević and Khalil Sima’an. 2014a. BEER:

BEtter Evaluation as Ranking. In Proceedings of the
Alexandra Birch and Miles Osborne. 2010. LRscore Ninth Workshop on Statistical Machine Translation,
for Evaluating Lexical and Reordering Quality in pages 414–419, Baltimore, Maryland, USA, June.
MT. In Proceedings of the Joint Fifth Workshop on Association for Computational Linguistics.
Statistical Machine Translation and MetricsMATR,
pages 327–332, Uppsala, Sweden, July. Association Miloš Stanojević and Khalil Sima’an. 2014b. Eval-
for Computational Linguistics. uating Word Order Recursively over Permutation-
Danqi Chen and Christopher D Manning. 2014. A fast Forests. In Proceedings of SSST-8, Eighth Work-
and accurate dependency parser using neural net- shop on Syntax, Semantics and Structure in Statis-
works. In Empirical Methods in Natural Language tical Translation, pages 138–147, Doha, Qatar, Oc-
Processing (EMNLP). tober. Association for Computational Linguistics.
Michael Denkowski and Alon Lavie. 2011. Meteor Miloš Stanojević and Khalil Sima’an. 2014c. Fitting
1.3: Automatic metric for reliable optimization and Sentence Level Translation Evaluation with Many
evaluation of machine translation systems. In Pro- Dense Features. In Proceedings of the 2014 Con-
ceedings of the Sixth Workshop on Statistical Ma- ference on Empirical Methods in Natural Language
chine Translation, WMT ’11, pages 85–91, Strouds- Processing (EMNLP), pages 202–206, Doha, Qatar,
burg, PA, USA. Association for Computational Lin- October. Association for Computational Linguistics.
guistics.
Miloš Stanojević. 2015. Removing Biases from Train-
Y. He and A. Way. 2009. Improving the objective func- able MT Metrics by Using Self-Training. arXiv
tion in minimum error rate training. Proceedings preprint arXiv:1508.02445.
of the Twelfth Machine Translation Summit, pages
238–245. Hui Yu, Xiaofeng Wu, Jun Xie, Wenbin Jiang, Qun Liu,
and Shouxun Lin. 2014. Red: A reference depen-
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. dency based mt evaluation metric. In COLING’14,
1999. Support Vector Learning for Ordinal Regres- pages 2042–2051.
sion. In In International Conference on Artificial
Neural Networks, pages 97–102. Hao Zhang and Daniel Gildea. 2007. Factorization of
synchronous context-free grammars in linear time.
Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito In In NAACL Workshop on Syntax and Structure in
Sudoh, and Hajime Tsukada. 2010. Automatic Statistical Translation (SSST.
Evaluation of Translation Quality for Distant Lan-
guage Pairs. In Proceedings of the 2010 Conference
cessing, EMNLP ’10, pages 944–952, Stroudsburg,
PA, USA. Association for Computational Linguis-
tics.
Matous Machacek and Ondrej Bojar. 2014. Results
of the wmt14 metrics shared task. In Proceedings
of the Ninth Workshop on Statistical Machine Trans-
lation, pages 293–301, Baltimore, Maryland, USA,
June. Association for Computational Linguistics.
Michael Denkowski and Alon Lavie. 2014. Meteor
Universal: Language Specific Translation Evalua-
tion for Any Target Language. In Proceedings of the
ACL 2014 Workshop on Statistical Machine Transla-
tion.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: A method for automatic
evaluation of machine translation. In Proceedings
of the 40th Annual Meeting on Association for Com-
putational Linguistics, ACL ’02, pages 311–318,
Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
Maja Popović and Hermann Ney. 2009. Syntax-
oriented evaluation measures for machine transla-
tion output. In Proceedings of the Fourth Work-
shop on Statistical Machine Translation, StatMT
’09, pages 29–32, Stroudsburg, PA, USA. Associ-
ation for Computational Linguistics.
401
Page 50 of 78
E Particle Swarm Optimization Submission for WMT16 Tuning Task
Particle Swarm Optimization Submission for WMT16 Tuning Task
Viktor Kocur Ondřej Bojar

CTU in Prague Charles University in Prague
FNSPE MFF ÚFAL
kocurvik@fjfi.cvut.cz bojar@ufal.mff.cuni.cz
Abstract translation metric2 . The standard optimization is a

variant of grid search and in our work, we replace
This paper describes our submission to the it with the Particle Swarm Optimization (PSO,
Tuning Task of WMT16. We replace the Eberhart et al., 1995) algorithm.
grid search implemented as part of stan- Particle Swarm Optimization is a good candi-
dard minimum-error rate training (MERT) date for an efficient implementation of the inner
in the Moses toolkit with a search based loop of MERT due to the nature of the optimiza-
on particle swarm optimization (PSO). An tion space. The so-called Traditional PSO (TPSO)
older variant of PSO has been previously has already been tested by Suzuki et al. (2011),
successfully applied and we now test it with a success. Improved versions of the PSO al-
in optimizing the Tuning Task model for gorithm, known as Standard PSO (SPSO), have
English-to-Czech translation. We also been summarized in Clerc (2012).
adapt the method in some aspects to al- In this paper, we test a modified version of
low for even easier parallelization of the the latest SPSO2011 algorithm within the Moses
search. toolkit and compare its results and computational
costs with the standard Moses implementation of
1 Introduction MERT.
Common models of statistical machine transla- 2 MERT
tion (SMT) consist of multiple features which as-
sign probabilities or scores to possible transla- The basic goal of MERT is to find optimal weights
tions. These are then combined in a weighted for various numerical features of an SMT system.
sum to determine the best translation given by the The weights are considered optimal if they min-
model. Tuning within SMT refers to the process of imize an automated error metric which compares
finding the optimal weights for these features on a the machine translation to a human translation for
given tuning set. This paper describes our submis- a certain tuning (development) set.
sion to WMT16 Tuning Task1 , a shared task where Formally, each feature provides a score (some-
all the SMT model components and the tuning set times a probability) that a given sentence e in goal
are given and task participants are expected to pro- language is the translation of the foreign sentence
vide only the weight settings. We took part only in f . Given a weight for each such feature, it is pos-
English-to-Czech system tuning. sible to combine the scores to a single figure and
Our solution is based on the standard tuning find the highest scoring translation. The best trans-
method of Minimum Error-Rate Training (MERT, lation can then be obtained by the following for-
Och, 2003). The MERT algorithm described in mula:
Bertoldi et al. (2009) is the default tuning method
in the Moses SMT toolkit (Koehn et al., 2007). X
The inner loop of the algorithm performs opti- e∗ = argmax λi log (pi (e|f )) = gp (λ) (1)
mization on a space of weight vectors with a given e
i
1 2
http://www.statmt.org/wmt16/ All our experiments optimize the default BLEU but other
tuning-task/ metrics could be directly tested as well.
515
Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 515–521,
Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics
Page 51 of 78
The process of finding the best translation e∗ is model has 21 features but adding sparse features,
called decoding. The translations can vary signif- we can get to thousands of dimensions.
icantly based on the values of the weights, there- These properties of the search space make PSO
fore it is necessary to find the weights that would an interesting candidate for the inner loop algo-
give the best result. This is achieved by minimiz- rithm. PSO is stochastic so it doesn’t require
ing the error of the machine translation against the smoothness of the optimized function. It is also
human translation: highly parallelizable and gains more power with
more CPUs available, which is welcome since the
λ∗ = argmin errf (gp (λ), ehuman ) (2) optimization itself is quite expensive. The simplic-
λ
ity of PSO also leaves space for various improve-
The error function can also be considered as a ments.
negative value of an automated scorer. The prob-
lem with this straight-forward approach is that de-
coding is computationally expensive. To reduce 3 PSO Algorithm
this cost, the decoder is not run for every consid-
ered weight setting. Instead, only some promis- The PSO algorithm was first described by Eber-
ing settings are tested in a loop (called the “outer hart et al. (1995). PSO is an iterative optimization
loop”): given the current best weights, the decoder method inspired by the behavior of groups of ani-
is asked to produce n best translation for each mals such as flocks of birds or schools of fish. The
sentence of the tuning set. This enlarged set of space is searched by individual particles with their
candidates allows us to estimate translation scores own positions and velocities. The particles can in-
for similar weight settings. An optimizer uses form others of their current and previous positions
these estimates to propose a new vector of weights and their properties.
and the decoder then tests this proposal in another
outer loop. The outer loop is stopped when no new
weight setting is proposed by the optimizer or no 3.1 TPSO
new translations are found by the decoder. The
The original algorithm is defined quite generally.
run of the optimizer is called the “inner loop”, al-
Let us formally introduce the procedure. The
though it need not be iterative in any sense. The
search space S is defined as
optimizer tries to find the best weights so that the
least erroneous translations appear as high as pos-
sible in the n-best lists of candidate translations. D
O
Our algorithm replaces the inner loop of MERT. S= [mind , maxd ] (3)
It is therefore important to describe the properties d=1
of the inner loop optimization task.
Due to finite number of translations accumu- where D is the dimension of the space and
lated in the n-best lists (across sentences as well as mind and maxd are the minimal and maximal
outer loop iterations), the error function changes values for the d-th coordinate. We try to find a
only when the change in weights leads to a change point in the space which maximizes a given func-
in the order of the n-best list. This is represented tion f : S 7→ R.
by numerous plateaus in the error function with
There are p particles and the i-th particle in
discontinuities on the edges of the plateaus. This
the n-th iteration has the following D-dimensional
prevents the use of simple gradient methods. We
vectors: position xni , velocity vin , and two vectors
can define a local optimum not in a strict math-
of maxima found so far: the best position pni vis-
ematical sense but as a plateau which has only
ited by the particle itself and the best known po-
higher or only lower plateaus at the edges. These
sition lni that the particle has learned about from
local optima can then be numerous within the
others.
search space and trap any optimizing algorithm,
thus preventing convergence to the global opti- In TPSO algorithm, the lni vector is always the
mum which is desired. globally best position visited by any particle so far.
Another problem is the relatively high dimen- The TPSO algorithm starts with simple initial-
sionality of the search space. The Tuning Task ization:
516
Page 52 of 78
located in any close vicinity). This set of best po-

sitions is limited to k elements, each new addition
x0i = rand(S) (4) over the limit k replaces the oldest information.
rand(S) − x0i To establish the “global” optimum lti , every parti-
vi0 = (5) cle consults only its set of learned best positions.
2
p0i = x0i (6) The algorithm starts with the initialization of
particle vectors given by the equations (4-6). The
l0i = argmax f (p0j ) (7)
j l0i is initialized with the value of p0i . The sets of
learned best positions are initialized as empty.
where the function rand(S) generates a random Two constants affect computations given below:
vector from space S with uniform distribution. w is again the slowdown and c controls the “ex-
The velocity for the next iteration is updated as pansion” of examined neighbourhood of each par-
follows: ticle. We set w and c to values that (as per Bonyadi
and Michalewicz, 2014) ensure convergence:
vit+1 = wvit + U (0, 1)φp pti + U (0, 1)φl lti (8) 1

w= ≈ 0.721 (12)
2ln(2)
where U (0, 1) denotes a random number be-
tween 0 and 1 with uniform distribution. The pa- 1
rameters w, φp , φl ∈ (0, 1) are set by the user and c= + ln(2) ≈ 1.193 (13)
2
indicate a slowdown, and the respective weight for
own vs. learned optimum.
All the following vectors are then updated:
clti
xt+1
i = xti + vit+1 (9)
lti
pt+1 = xt+1 if f (xt+1 t
i ) > f (pi ) (10) yit
i i wvit
lt+1
i = argmax(f (pt+1
j )) (11)
j xt+1
i
The process continues with the next iteration Gti

until all of the particles converge to proximity of a vit+1
cpti
certain point. Other stopping criteria are also used. pti
vit
3.2 Modified SPSO2011
xti
We introduce a number of changes to the algo-
rithm SPSO2011 described by Clerc (2012). Figure 1: Construction of the particle position up-
In SPSO2011 the global best position lti is re- date. The grey area indicates P (G, x).
placed by the best position the particle has re-
For the update of velocity, it is first necessary to
ceived information about from other particles. In
calculate a “center of gravity” Gti of three points:
the original SPSO2011 this is done in a synchro-
the current position xti , a slightly “expanded” cur-
nized fashion: after every iteration, all particles
rent best position pti and a slightly expanded best
send their best personal positions to m other parti-
position known by colleagues lti . The “expansion”
cles. Every particle chooses the best position it has
of the positions is controlled by c and directed out-
received in the current iteration and sets its lti ac-
wards from xti :
cordingly. This generalization of lti is introduced
in order to combat premature convergence to a lo-
pti + lti − 2xti
cal optimum. Gti = xti + c · (14)
3
To avoid waiting until all particles finish their
computation, we introduce per-particle memory To introduce further randomness, xti is relocated
of “learned best positions” called the “neighbour- to a position yit sampled from the uniform distri-
hood set” (although its members do not have to be bution in the area P (Gti , xti ) formally defined as:
517
Page 53 of 78
the algorithm for a fixed number of position up-

dates, specifically 32000. Later, we changed the
D h
O i
algorithm to terminate after the manager has seen
P (G, x) = Gd − |Gd − xd |, Gd + |Gd − xd |
3200 position updates without any update of the
d=1
(15) global best position. In the following section, we
Our P (G, x) is a hypercube centered in Gti and refer to the former as PSO without the termination
touching xti , see Figure 1 for an illustration. The condition (PSO) and the latter as PSO with the ter-
original SPSO2011 used a d-dimensional ball with mination condition (PSO-T).
the center in G and radius kG − xk to avoid the Properties of SPSO2011 have been investigated
bias of searching towards points on axes. We are by Bonyadi and Michalewicz (2014). We use a
less concerned about this and opt for a simpler and slightly different algorithm, but our modifications
faster calculation. should have an effect only on rotational invariance,
The new velocity is set to include the previous which is not so much relevant for our purpose.
velocity (reduced by w) as well as the speedup Aside from the discussion on the values of w and
caused by the random relocation: c with respect to the convergence of all particles
to the same point, Bonyadi and Michalewicz also
vit+1 = wvit + yit − xti (16) mention that SPSO2011 is not guaranteed to con-
verge to a local optimum. Since our search space
Finally, the particle position is updated: is discontinuous with plateaus, the local conver-
gence in the mathematical sense is not especially
xt+1
i = xti + vit+1 = wvit + yit (17) useful anyway.
The optimized function is evaluated at the new 4 Implementation

position xt+1
i and the particle’s best position is up-
dated if a new optimum was found. In any case, We implemented the algorithm described above
the best position pt+1
i together with its value is with one parameter, the number of particles. We
sent to m randomly selected particles (possibly in- set the size of the neighborhood set, denoted k
cluding the current particle) to be included in their above, to 4 and the number of random particles re-
sets of learned best positions as described above. ceiving the information about a particle’s best po-
The particle then sets its lt+1
i to best position from sition so far (m) to 3.
its own list of learned positions. The implementation of our version of the PSO
The next iteration continues with the updated algorithm is built within the standard Moses code.
vectors. Normally, the algorithm would terminate The algorithm itself creates a reasonable parallel
when all particles converge to a close proximity structure with each thread representing a single
to each other, but it turns out that this often leads particle.
to premature stopping. There are many other ap- We use similar object structure as the base-
proaches possible to this problem (Xinchao, 2010; line MERT implementation. The points are rep-
Evers and Ben Ghalia, 2009), but we choose a sim- resented by their own class which handles basic
ple restarting strategy: when the particle is send- arithmetic and stream operations. The class car-
ing out its new best position and value to m fel- ries not only the vector of the current position but
lows, the manager responsible for this checks if also its associated score.
this value was not reported in the previous call Multiple threads are maintained by the stan-
(from any other particle). If it was, then the current dard Moses thread pools (Haddow, 2012). Ev-
particle is instructed to restart itself by setting all ery thread (“Task” in Moses thread pools) cor-
of its vectors to random initial state.3 The neigh- responds to a particle and is responsible for cal-
borhood set is left unchanged. The restart prevents culating its search in the space using the class
multiple particles exploring the same area. PSOOptimizer. There are no synchronous it-
The drawback of restarts is that the stopping cri- erations, each particle proceeds at its own pace.
terion is never met. In our first version, we ran All optimizers have access to a global manager
3
object of class PSOManager, see Figure 2 for an
The use of score and not position is possible due to the
nature of the space in which a same score of two points very illustration. The manager provides methods for
likely means that the points are equivalent. the optimizers to get the best vector lti from the
518
Page 54 of 78
Run PSO-16 PSO-64 PSO-T-16 PSO-T-64 MERT-16

1 14.5474 15.6897 15.6133 15.6613 14.5470
2 17.3292 18.7340 18.7437 18.4464 18.8704
3 18.9261 18.9788 18.9711 18.9069 19.0625
4 19.0926 19.2060 19.0646 19.0785 19.0623
5 19.1599 19.2140 19.0968 19.0738 19.1992
6 19.2444 19.2319 - 19.0772 19.1751
7 19.2470 19.2383 - - 19.0480
8 19.2613 19.2245 - - 19.1359
12 - - - - 19.1625
Table 1: The final best BLEU score after the runs of the inner loop for PSO without and with the
termination condition with 16 and 64 threads respectively and standard Moses MERT implementation
with 16 threads.
pared to the calculations performed in the optimiz-

AllTasks
ers. The only locking occurs when threads are try-
PSOOptimizationTask PSOOptimizationTask
ing to add points; read access to the manager can
...
PSOOptimizer PSOOptimizer be concurrent.
FeatureData FeatureData
5 Results
ScorerData ScorerData
We ran the tuning only for the English to Czech

part of the tuning task. We filtered and binarized
the model supplied by the organizers to achieve
PSOManager
+addPoint(Point p)
better performance and smaller memory costs.
+getBestNeighbor(int i, Point P) For the computation, we used the services of
+cont()
Metacentrum VO. Due to the relatively high mem-
ory demands we used two SGI UV 2000 machines:
Figure 2: Base structure of our PSO algorithm one with 48x 6-core Intel Xeon E5-4617 2.9GHz
and 6TB RAM and one with 48x 8-core Intel Xeon
E5-4627v2 3.30GHz and 6TB RAM. We ran the
neighborhood set, to report its best position to the tuning process on 16 and 64 CPUs, i.e. with 16
random m particles (addPoint) and to check if and 64 particles, respectively. We submitted the
the optimization should still run (cont) or termi- weights from the 16-CPU run. We also ran a test
nate. The method addPoint serves two other run using the standard Moses MERT implementa-
purposes: incrementing an internal counter of it- tion with 16 threads for a comparison.
erations and indicating through its return value Table 1 shows the best BLEU scores at the end
whether the reporting particle should restart itself. of each inner loop (as projected from the n-best
Every optimizer has its own FeatureData lists on the tuning set of sentences). Both meth-
and ScorerData, which are used to determine ods provide similar results. Since the methods are
the score of the investigated points. As of now, stochastic, different runs will lead to different best
the data is loaded serially, so the more threads we positions (and different scores).
have, the longer the initialization takes. In the Comparison of our implementation with with
baseline implementation of MERT, all the threads the baseline MERT on a test set is not nec-
share the scoring data. This means that the data essary. Both implementations try to maximize
is loaded only once, but due to some unexpected BLEU score, therefore any overtraining occurring
locking, the baseline implementation never gains in the baseline MERT occurs also in our imple-
speedups higher than 1.5, even with 32 threads, mentation and vice versa.
see Table 2 below. Table 2 shows the average run times and
This structure allows an efficient use of multi- reached scores for 8 runs of the baseline MERT
ple cores. Methods of the manager are fast com- and our PSO and PSO-T, starting with the same
519
Page 55 of 78
Wall Clock [s] Projected BLEU Reached

Outer Loop CPUs MERT PSO PSO-T MERT PSO PSO-T
1 1 186.24±10.63 397.28±2.13 62.37±19.64 14.50±0.03 13.90±0.05 13.84±0.05
1 4 123.51±3.58 72.75±1.12 21.94±4.63 14.51±0.03 14.48±0.08 14.46±0.06
1 8 135.40±8.43 43.07±0.78 15.62±3.40 14.52±0.04 14.53±0.05 14.42±0.12
1 16 139.43±8.00 33.00±1.37 14.59±2.21 14.53±0.02 14.51±0.08 14.48±0.10
1 24 119.69±4.43 32.20±1.62 16.89±3.16 14.52±0.02 14.55±0.06 14.47±0.07
1 32 119.04±4.47 33.42±2.16 19.16±2.92 14.53±0.03 14.50±0.04 14.50±0.07
3 1 701.18±47.13 1062.38±1.88 117.64±0.47 18.93±0.04 18.08±0.00 18.08±0.00
3 4 373.69±28.37 189.86±0.64 57.28±23.61 18.90±0.00 18.82±0.12 18.81±0.07
3 8 430.88±24.82 111.50±0.53 37.92±8.68 18.95±0.05 18.89±0.09 18.87±0.06
3 16 462.77±18.78 80.54±5.39 29.62±4.34 18.94±0.04 18.94±0.07 18.90±0.05
3 24 392.66±13.39 74.08±3.64 31.67±3.47 18.94±0.04 18.93±0.05 18.86±0.05
3 32 399.93±27.68 82.83±3.82 37.70±4.52 18.91±0.01 18.90±0.05 18.87±0.06
Table 2: Average run times and reached scores. The ± are standard deviations.
n-best lists as accumulated in iteration 1 and 3 of ing language resources stored and distributed
the outer loop. Note that PSO and PSO-T use only by the LINDAT/CLARIN project of the Min-
as many particles as there are threads, so running istry of Education, Youth and Sports of the
them with just one thread leads to a degraded per- Czech Republic (project LM2015071). Compu-
formace in terms of BLEU. With 4 or 8 threads, tational resources were supplied by the Ministry
the three methods are on par in terms of tuning- of Education, Youth and Sports of the Czech
set BLEU. Starting from 4 threads, both PSO and Republic under the Projects CESNET (Project
PSO-T terminate faster than the baseline MERT No. LM2015042) and CERIT-Scientific Cloud
implementation. Moreover the baseline MERT (Project No. LM2015085) provided within the
proved unable to utilize multiple CPUs efficiently, program Projects of Large Research, Development
whereas PSO gives us up to 14-fold speedup. and Innovations Infrastructures.
In general, the higher the ratio of the serial data
loading to the search computation time, the worse References
the speedup. The search in PSO-T takes much Nicola Bertoldi, Barry Haddow, and Jean-Baptiste
shorter time so the overhead of serial data loading Fouet. 2009. Improved minimum error rate
is more apparent and PSO-T seems parallelized training in moses. The Prague Bulletin of Math-
badly and gives only quadruple speedup. The re- ematical Linguistics 91:7–16.
duction of this overhead is highly desirable. Mohammad Reza Bonyadi and Zbigniew
Michalewicz. 2014. Spso 2011: Analysis of
6 Conclusion
stability; local convergence; and rotation sensi-
We presented our submission to the WMT16 Tun- tivity. In Proceedings of the 2014 conference on
ing Task, a variant of particle swarm optimization Genetic and evolutionary computation. ACM,
applied to minimum error-rate training in statisti- pages 9–16.
cal machine translation. Our method is a drop-in Maurice Clerc. 2012. Standard particle swarm op-
replacement of the standard Moses MERT and has timisation .
the benefit of easy parallelization. Preliminary ex- Russ C Eberhart, James Kennedy, et al. 1995.
periments suggest that it indeed runs faster and de- A new optimizer using particle swarm theory.
livers comparable weight settings. In Proceedings of the sixth international sym-
The effects on the number of iterations of the posium on micro machine and human science.
MERT outer loop as well as on the test-set perfor- New York, NY, volume 1, pages 39–43.
mance have still to be investigated.
George I Evers and Mounir Ben Ghalia. 2009. Re-
Acknowledgments grouping particle swarm optimization: a new
global optimization algorithm with improved
This work has received funding from the Eu- performance consistency across benchmarks. In
ropean Union’s Horizon 2020 research and in- Systems, Man and Cybernetics, 2009. SMC
novation programme under grant agreement no. 2009. IEEE International Conference on. IEEE,
645452 (QT21). This work has been us- pages 3901–3908.
520
Page 56 of 78
Barry Haddow. 2012. Adding Multi-Threaded De-

coding to Moses. Prague Bulletin of Mathemat-
ical Linguistics 93:57–66.
Philipp Koehn, Hieu Hoang, Alexandra Birch,
Chris Callison-Burch, Marcello Federico,
Nicola Bertoldi, Brooke Cowan, Wade Shen,
Christine Moran, Richard Zens, et al. 2007.
Moses: Open source toolkit for statistical
machine translation. In Proceedings of the
45th annual meeting of the ACL on interactive
poster and demonstration sessions. Association
for Computational Linguistics, pages 177–180.
Franz Josef Och. 2003. Minimum error rate train-
ing in statistical machine translation. In Pro-
ceedings of the 41st Annual Meeting on Asso-
ciation for Computational Linguistics-Volume
1. Association for Computational Linguistics,
pages 160–167.
Jun Suzuki, Kevin Duh, and Masaaki Nagata.
2011. Distributed minimum error rate training
of smt using particle swarm optimization. In
IJCNLP. pages 649–657.
Zhao Xinchao. 2010. A perturbed particle swarm
algorithm for numerical optimization. Applied
Soft Computing 10(1):119–124.
521
Page 57 of 78
F CharacTER: Translation Edit Rate on Character Level
G Exponentially Decaying Bag-of-Words Input Features for Feed-

Forward Neural Network in Statistical Machine Translation
Exponentially Decaying Bag-of-Words Input Features for Feed-Forward

Neural Network in Statistical Machine Translation
Jan-Thorsten Peter, Weiyue Wang, Hermann Ney

Human Language Technology and Pattern Recognition, Computer Science Department
RWTH Aachen University, 52056 Aachen, Germany
{peter,wwang,ney}@cs.rwth-aachen.de
Abstract context length on source and target sides. Using

the Bag-of-Words (BoW) model as additional in-
Recently, neural network models have put of a neural network based language model,
achieved consistent improvements in sta- (Mikolov et al., 2015) have achieved very simi-
tistical machine translation. However, lar perplexities on automatic speech recognition
most networks only use one-hot encoded tasks in comparison to the long short-term mem-
input vectors of words as their input. ory (LSTM) neural network, whose structure is
In this work, we investigated the ex- much more complex. This suggests that the bag-
ponentially decaying bag-of-words input of-words model can effectively store the longer
features for feed-forward neural network term contextual information, which could show
translation models and proposed to train improvements in statistical machine translation as
the decay rates along with other weight pa- well. Since the bag-of-words representation can
rameters. This novel bag-of-words model cover as many contextual words without further
improved our phrase-based state-of-the-art modifying the network structure, the problem of
system, which already includes a neural limited context window size of feed-forward neu-
network translation model, by up to 0.5% ral networks is reduced. Instead of predefining
B LEU and 0.6% T ER on three different fixed decay rates for the exponentially decaying
translation tasks and even achieved a simi- bag-of-words models, we propose to learn the de-
lar performance to the bidirectional LSTM cay rates from the training data like other weight
translation model. parameters in the neural network model.
2 The Bag-of-Words Input Features

1 Introduction
The bag-of-words model is a simplifying repre-
Neural network models have recently gained much sentation applied in natural language processing.
attention in research on statistical machine trans- In this model, each sentence is represented as the
lation. Several groups have reported strong im- set of its words disregarding the word order. Bag-
provements over state-of-the-art baselines when of-words models are used as additional input fea-
combining phrase-based translation with feed- tures to feed-forward neural networks in addition
forward neural network-based models (FFNN) to the one-hot encoding. Thus, the probability of
(Schwenk et al., 2006; Vaswani et al., 2013; the feed-forward neural network translation model
Schwenk, 2012; Devlin et al., 2014), as well with an m-word source window can be written as:
as with recurrent neural network models (RNN)
I
(Sundermeyer et al., 2014). Even in alternative Y
p(eI1 | f1J ) ≈ p(ei | fbbii−∆
+∆m
m
, fBoW,i ) (1)
translation systems they showed remarkable per- i=1
formance (Sutskever et al., 2014; Bahdanau et al.,
2015). where ∆m = m−1 2 and bi is the index of the single
The main drawback of a feed-forward neural aligned source word to the target word ei . We ap-
network model compared to a recurrent neural plied the affiliation technique proposed in (Devlin
network model is that it can only have a limited et al., 2014) for obtaining the one-to-one align-
Page 58 of 78
Page 59 of 78
ments. The bag-of-words input features fBoW,i can contextual words from the current word. There-
be seen as normalized n-of-N vectors as demon- fore the bag-of-words vector with decay weights
strated in Figure 1, where n is the number of words can be defined as following:
inside each bag-of-words. X
f˜BoW,i = d|i−k| f˜k (2)
k∈SBoW
where
[0 1 0 0 · · · 0 0] [0 0 0 1 · · · 0 0] [0 0 1 0 · · · 0 0] [ n1 1 1 1 1
0 0 0 ··· n n] [0 0 n 0 ··· n 0]
i, k Positions of the current word and words
original word features bag-of-words input features
within the BoW model respectively.
Figure 1: The bag-of-words input features along f˜BoW,i The value vector of the BoW input fea-
with the original word features. The input vectors ture for the i-th word in the sentence.
are projected and concatenated at the projection
layer. We omit the hidden and output layers for f˜k One-hot encoded feature vector of the k-
simplification, since they remain unchanged. th word in the sentence.
SBoW Indices set of the words contained in the

2.1 Contents of Bag-of-Words Features BoW. If a word appears more than once
in the BoW, the index of the nearest one
Before utilizing the bag-of-words input features
to the current word will be selected.
we have to decide which words should be part of
it. We tested multiple different variants: d Decay rate with float value ranging from
zero to one. It specifies how fast weights
1. Collecting all words of the sentence in one bag- of contextual words decay along with dis-
of-words except the currently aligned word. tances, which can be learned like other
weight parameters of the neural network.
2. Collecting all preceding words in one bag-of-
words and all succeeding words in a second Instead of using fixed decay rate as in (Irie et al.,
bag-of-words. 2015), we propose to train the decay rate like other
weight parameters in the neural network. The ap-
3. Collecting all preceding words in one bag-of- proach presented by (Mikolov et al., 2015) is com-
words and all succeeding words in a second parable to the corpus decay rate shown here, ex-
bag-of-words except those already included in cept that their work makes use of a diagonal ma-
the source window. trix instead of a scalar as decay rate. In our ex-
periments, three different kinds of decay rates are
All of these variants provide the feed-forward
trained and applied:
neural network with an unlimited context in both
directions. The differences between these setups 1. Corpus decay rate: all words in vocabulary
only varied by 0.2% B LEU and 0.1% T ER. We share the same decay rate.
choose to base further experiments on the last vari-
ant since it performed best and seemed to be the 2. Individual decay rate for each bag-of-words:
most logical choice for us. each bag-of-words has its own decay rate given
the aligned word.
2.2 Exponentially Decaying Bag-of-Words
3. Individual decay rate for each word: each word
Another variant is to weight the words within uses its own decay rate.
the bag-of-words model. In the standard bag-
of-words representation these weights are equally We use the English sentence
distributed for all words. This means the bag-of- “friends had been talking about this fish for a long time”
words input is a vector which marks if a word is as an example to clarify the differences between
given or not and does not encode the word or- these variants. A five words contextual window
der. To avoid this problem, the exponential decay centered at the current aligned word fish has
approach proposed in (Clarkson and Robinson, been applied: {about, this, fish, for, a}.
1997) has been adopted to express the distance of The bag-of-words models are used to collect all
Page 60 of 78
other source words outside the context window: • Enhanced low frequency counts
{friends, had, been, talking} and {long, (Chen et al., 2011)
time}. Furthermore, there are multiple choices
for assigning decay weights to all these words in • 4-gram language model
the bag-of-words feature:
• 7-gram word class language model
Sentence: friends had been talking about this fish for a long time (Wuebker et al., 2013)
• Word and phrase penalties

Distance: 6 5 4 3 3 4
• Hierarchical reordering model

(Galley and Manning, 2008)
1. Corpus decay rate: d
Additionally, a neural network translation model,
Weights: d6 d5 d4 d3 d3 d4
similar to (Devlin et al., 2014), with following
configurations is applied for reranking the n-best
2. Bag-of-words individual decay rate: d = dfish lists:
Weights: d6fish d5fish d4fish d3fish d3fish d4fish • Projection layer size 100 for each word
• Two non-linear hidden layers with 1000 and 500

3. Word individual decay rate: nodes respectively
d ∈ {dfriends , dhad , dbeen , dtalking , dlong , dtime }
• Short-list size 10000 along with 1000 word
Weights: d6friends d5had d4been d3talking d3long d4time
classes at the output layer
• 5 one-hot input vectors of words
Unless otherwise stated, the investigations on bag-

3 Experiments of-words input features are based on this neural
3.1 Setup network model. We also integrated our neural net-
work translation model into the decoder as pro-
Experiments are conducted on the IWSLT 2013 posed in (Devlin et al., 2014). The relative im-
German→English, WMT 2015 German→English provements provided by integrated decoding and
and DARPA BOLT Chinese→English translation reranking are quite similar, which can also be con-
tasks. GIZA++ (Och and Ney, 2003) is applied firmed by (Alkhouli et al., 2015). We therefore
for aligning the parallel corpus. The translation decided to only work in reranking for repeated ex-
quality is evaluated by case-insensitive B LEU (Pa- perimentation.
pineni et al., 2002) and T ER (Snover et al., 2006)
metric. The scaling factors are tuned with MERT 3.2 Exponentially Decaying Bag-of-Words
(Och, 2003) with B LEU as optimization criterion
As shown in Section 2.2, the exponential decay
on the development sets. The systems are evalu-
approach is applied to express the distance of con-
ated using MultEval (Clark et al., 2011). In the
textual words from the current word. Thereby the
experiments the maximum size of the n-best lists
information of sequence order can be included into
applied for reranking is 500. For the translation
bag-of-words models. We demonstrated three dif-
experiments, the averaged scores are presented on
ferent kinds of decay rates for words in the bag-
the development set from three optimization runs.
of-words input feature, namely the corpus general
Experiments are performed using the Jane
decay rate, the bag-of-words individual decay rate
toolkit (Vilar et al., 2010; Wuebker et al., 2012)
and the word individual decay rate.
with a log-linear framework containing following
Table 1 illustrates the experimental results of
feature functions:
the neural network translation model with ex-
• Phrase translation probabilities both directions ponentially decaying bag-of-words input features
on IWSLT 2013 German→English, WMT 2015
• Word lexicon features in both directions German→English and BOLT Chinese→English
Page 61 of 78
IWSLT WMT BOLT

test eval11 newstest2013 test
B LEU[%] T ER[%] B LEU[%] T ER[%] B LEU[%] T ER[%] B LEU[%] T ER[%]
Baseline + NNTM 31.9 47.5 36.7 43.0 28.8 53.8 17.4 67.1
+ BoW Features 32.0 47.3 36.9 42.9 28.8 53.5∗ 17.5 67.0
+ Fixed DR (0.9) 32.2∗ 47.3 37.0∗ 42.6∗† 29.0 53.5∗ 17.7∗ 66.8∗
+ Corpus DR 32.1 47.3 36.9 42.7∗ 29.1∗† 53.5∗ 17.7∗ 66.7∗†
+ BoW DR 32.4∗† 47.0∗† 37.2∗† 42.4∗† 29.2∗† 53.2∗† 17.9∗† 66.6∗†
+ Word DR 32.3∗† 47.0∗ 37.1∗ 42.7∗ 29.1∗† 53.4∗ 17.8∗† 66.7∗†
Baseline + LSTM 32.2∗ 47.4 37.1∗ 42.5∗† 29.0 53.3∗ 17.6 66.8∗
Table 1: Experimental results of translations using exponentially decaying bag-of-words models with
different kinds of decay rates. Improvements by systems marked by ∗ have a 95% statistical significance
from the baseline system, whereas † denotes the 95% statistical significant improvements with respect to
the BoW Features system (without decay weights). We experimented with several values for the fixed
decay rate (DR) and 0.9 performed best. The applied RNN model is the LSTM bidirectional translation
model proposed in (Sundermeyer et al., 2014).
translation tasks. Here we applied two bag-of- 3.3 Comparison between Bag-of-Words and
words models to separately contain the preced- Large Context Window
ing and succeeding words outside the context win- The main motivation behind the usage of the bag-
dow. We can see that the bag-of-words feature of-words input features is to provide the model
without exponential decay weights only provides with additional context information. We compared
small improvements. After appending the de- the bag-of-words input features to different source
cay weights, four different kinds of decay rates side windows to refute the argument that simply
provide further improvements to varying degrees. increasing the size of the window could achieve
The bag-of-words individual decay rate performs the same results. Our experiments showed that in-
the best, which gives us improvements by up to creasing the source side window beyond 11 gave
0.5% on B LEU and up to 0.6% on T ER. On these no more improvements while the model that used
tasks, these improvements even help the feed- the bag-of-words input features is able to achieve
forward neural network achieve a similar perfor- the best result (Figure 2). A possible explanation
mance to the popular long short-term memory re- for this could be that the feed-forward neural net-
current neural network model (Sundermeyer et al., work learns its input position-dependent. If one
2014), which contains three LSTM layers with source word is moved by one position the feed-
200 nodes each. The results of the word individual forward neural network needs to have seen a word
decay rate are worse than that of the bag-of-words with a similar word vector at this position dur-
decay rate. One reason is that in word individual ing training to interpret it correctly. The likeli-
case, the sequence order can still be missing. We hood of precisely getting the position decreases
initialize all values for the tunable decay rates with with a larger distance. The bag-of-words model
0.9. In the IWSLT 2013 German→English task, on the other hand will still get the same input only
the corpus decay rate is tuned to 0.578. When in- slightly stronger or weaker on the new distance
vestigating the values of the trained bag-of-words and decay rate.
individual decay rate vector, we noticed that the
variance of the value for frequent words is much
lower than for rare words. We also observed that 4 Conclusion
most function words, such as prepositions and
conjunctions, are assigned low decay rates. We The aim of this work was to investigate the influ-
could not find a pattern for the trained value vec- ence of exponentially decaying bag-of-words in-
tor of the word individual decay rates. put features with trained decay rates on the feed-
forward neural network translation model. Ap-
plying the standard bag-of-words model as an ad-
ditional input feature in our feed-forward neural
network translation model only yields slight im-
Page 62 of 78
37.2 References
5 words + 2 BoWs
B LEU scores on eval
37.1 Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015.

37 Investigations on phrase-based decoding with recur-
rent neural network language and translation mod-
36.9 els. In EMNLP 2015 Tenth Workshop on Statistical
36.8 Machine Translation (WMT 2015), pages 294–303,
Lisbon, Portugal, September.
36.7
36.6 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
36.5 gio. 2015. Neural machine translation by jointly
learning to align and translate. In Proceedings of
3 5 7 9 11 15 21 the 3rd International Conference on Learning Rep-
Context window size (|input vectors|) resentations, San Diego, CA, USA, May.
Boxing Chen, Roland Kuhn, George Foster, and

Figure 2: The change of B LEU scores Howard Johnson. 2011. Unpacking and Transform-
on the eval11 set of the IWSLT 2013 ing Feature Functions: New Ways to Smooth Phrase
German→English task along with the source Tables. In Proceedings of MT Summit XIII, pages
269–275, Xiamen, China, September.
context window size. The source windows are
always symmetrical with respect to the aligned Jonathan H. Clark, Chris Dyer, Alon Lavie, and
word. For instance, window size five denotes that Noah A. Smith. 2011. Better Hypothesis Testing for
two preceding and two succeeding words along Statistical Machine Translation: Controlling for Op-
timizer Instability. In Proceedings of the 49th An-
with the aligned word are included in the window.
nual Meeting of the Association for Computational
The average sentence length of the corpus is about Linguistics: Short Papers, pages 176–181, Portland,
18 words. The red line is the result of using a OR, USA, June.
model with bag-of-words input features and a
bag-of-words individual decay rate. P. Clarkson and A. Robinson. 1997. Language Model
Adaptation Using Mixtures and an Exponentially
Decaying Cache. In Proceedings of the 1997 IEEE
International Conference on Acoustics, Speech, and
provements, since the original bag-of-words rep- Signal Processing, pages 799–802, Washington,
resentation does not include information about the DC, USA, April.
ordering of each word. To avoid this problem, we Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas
applied the exponential decay weight to express Lamar, Richard Schwartz, and John Makhoul. 2014.
the distances between words and propose to train Fast and robust neural network joint models for sta-
the decay rate as other weight parameters of the tistical machine translation. In Proceedings of the
52nd Annual Meeting of the Association for Com-
network. Three different kinds of decay rates are putational Linguistics, pages 1370–1380, Baltimore,
proposed, the bag-of-words individual decay rate MD, USA, June.
performs best and provides improvements by av-
eragely 0.5% B LEU on three different translation Michel Galley and Christopher D. Manning. 2008. A
simple and effective hierarchical phrase reordering
tasks, which is even able to outperform a bidirec- model. In Proceedings of the 2008 Conference on
tional LSTM translation model on the given tasks. Empirical Methods in Natural Language Process-
By contrast, applying additional one-hot encoded ing, pages 848–856, Honolulu, HI, USA, October.
input vectors or enlarging the network structure
can not achieve such good performances as bag- Kazuki Irie, Ralf Schlter, and Hermann Ney. 2015.
Bag-of-Words Input for Long History Representa-
of-words features. tion in Neural Network-based Language Models for
Speech Recognition. In Proceedings of the 16th
Annual Conference of International Speech Com-
Acknowledgments munication Association, pages 2371–2375, Dresden,
Germany, September.
This paper has received funding from the Euro- Tomas Mikolov, Armand Joulin, Sumit Chopra,
pean Union’s Horizon 2020 research and innova- Michael Mathieu, and Marc’Aurelio Ranzato. 2015.
tion programme under grant agreement no 645452 Learning longer memory in recurrent neural net-
works. In Proceedings of the 3rd International Con-
(QT21). ference on Learning Representations, San Diego,
CA, USA, May.
Page 63 of 78
Franz Josef Och and Hermann Ney. 2003. A Sys- Joern Wuebker, Matthias Huck, Stephan Peitz, Malte
tematic Comparison of Various Statistical Align- Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab
ment Models. Computational Linguistics, 29:19– Mansour, and Hermann Ney. 2012. Jane 2:
51, March. Open Source Phrase-based and Hierarchical Statis-
tical Machine Translation. In International Confer-
Franz Josef Och. 2003. Minimum Error Rate Train- ence on Computational Linguistics, pages 483–491,
ing in Statistical Machine Translation. In Proceed- Mumbai, India, December.
ings of the 41st Annual Meeting on Association
for Computational Linguistics, pages 160–167, Sap- Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
poro, Japan, July. mann Ney. 2013. Improving Statistical Machine
Translation with Word Class Models. In Proceed-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ings of the 2013 Conference on Empirical Methods
Jing Zhu. 2002. BLEU: A Method for Automatic in Natural Language Processing, pages 1377–1381,
Evaluation of Machine Translation. In Proceedings Seattle, WA, USA, October.
of the 40th Annual Meeting on Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
PA, USA, July.
Holger Schwenk, Daniel Déchelotte, and Jean-Luc
Gauvain. 2006. Continuous space language models
for statistical machine translation. In Proceedings of
the 44th Annual Meeting of the International Com-
mittee on Computational Linguistics and the Asso-
ciation for Computational Linguistics, pages 723–
730, Sydney, Australia, July.
Holger Schwenk. 2012. Continuous space translation
models for phrase-based statistical machine transla-
tion. In Proceedings of the 24th International Con-
ference on Computational Linguistics, pages 1071–
1080, Mumbai, India, December.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
nea Micciulla, and John Makhoul. 2006. A study
of translation edit rate with targeted human annota-
tion. In Proceedings of the Conference of the As-
sociation for Machine Translation in the Americas,
pages 223–231, Cambridge, MA, USA, August.
Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,
and Hermann Ney. 2014. Translation modeling
with bidirectional recurrent neural networks. In Pro-
ceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing, pages 14–25,
Doha, Qatar, October.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
works. In Z Ghahramani, M Welling, C Cortes, N D
Lawrence, and K Q Weinberger, editors, Advances
in Neural Information Processing Systems 27, pages
3104–3112. Curran Associates, Inc.
Ashish Vaswani, Yinggong Zhao, Victoria Fossum,
and David Chiang. 2013. Decoding with large-
scale neural language models improves translation.
In Proceedings of the 2013 Conference on Empiri-
cal Methods in Natural Language Processing, pages
1387–1392, Seattle, WA, USA, October.
mann Ney. 2010. Jane: Open Source Hierarchi-
cal Translation, Extended with Reordering and Lex-
icon Models. In ACL 2010 Joint Fifth Workshop on
Statistical Machine Translation and Metrics MATR,
pages 262–270, Uppsala, Sweden, July.
Page 64 of 78
H A Comparative Study on Vocabulary Reduction for Phrase Table

Smoothing
A Comparative Study on Vocabulary Reduction

for Phrase Table Smoothing
Yunsu Kim, Andreas Guta, Joern Wuebker∗ , and Hermann Ney

∗
Lilt, Inc.
joern@lilt.com
Abstract label vocabularies with arbitrary sizes and struc-

tures. Our experiments reveal that the vocabulary
This work systematically analyzes the of the smoothing model has no significant effect
smoothing effect of vocabulary reduction on the end-to-end translation quality. For exam-
for phrase translation models. We ex- ple, a randomized label space also leads to a de-
tensively compare various word-level vo- cent improvement of B LEU or T ER scores by the
cabularies to show that the performance presented smoothing models.
of smoothing is not significantly affected We also test vocabulary reduction in transla-
by the choice of vocabulary. This result tion scenarios of different scales, showing that the
provides empirical evidence that the stan- smoothing works better with more parallel cor-
dard phrase translation model is extremely pora.
sparse. Our experiments also reveal that
vocabulary reduction is more effective for 2 Related Work
smoothing large-scale phrase tables.
Koehn and Hoang (2007) propose integrating a la-
1 Introduction bel vocabulary as a factor into the phrase-based
SMT pipeline, which consists of the following
Phrase-based systems for statistical machine trans- three steps: mapping from words to labels, label-
lation (SMT) (Zens et al., 2002; Koehn et al., to-label translation, and generation of words from
2003) have shown state-of-the-art performance labels. Rishøj and Søgaard (2011) verify the ef-
over the last decade. However, due to the huge size fectiveness of word classes as factors. Assuming
of phrase vocabulary, it is difficult to collect robust probabilistic mappings between words and labels,
statistics for lots of phrase pairs. The standard the factorization implies a combinatorial expan-
phrase translation model thus tends to be sparse sion of the phrase table with regard to different
(Koehn, 2010). vocabularies.
A fundamental solution to a sparsity problem in Wuebker et al. (2013) show a simplified case of
natural language processing is to reduce the vo- the factored translation by adopting hard assign-
cabulary size. By mapping words onto a smaller ment from words to labels. In the end, they train
label space, the models can be trained to have the existing translation, language, and reordering
denser distributions (Brown et al., 1992; Miller et models on word classes to build the corresponding
al., 2004; Koo et al., 2008). Examples of such la- smoothing models.
bels are part-of-speech (POS) tags or lemmas. Other types of features are also trained on word-
In this work, we investigate the vocabulary re- level labels, e.g. hierarchical reordering fea-
duction for phrase translation models with respect tures (Cherry, 2013), an n-gram-based translation
to various vocabulary choice. We evaluate two model (Durrani et al., 2014), and sparse word pair
types of smoothing models for phrase translation features (Haddow et al., 2015). The first and the
probability using different kinds of word-level la- third are trained with a large-scale discriminative
bels. In particular, we use automatically gener- training algorithm.
ated word classes (Brown et al., 1992) to obtain For all usages of word-level labels in SMT,
Page 65 of 78
a common and important question is which la- where N is the count of a phrase or a phrase pair
bel vocabulary maximizes the translation quality. in the training data. These counts are very low for
Bisazza and Monz (2014) compare class-based many phrases due to a limited amount of bilingual
language models with diverse kinds of labels in training data.
terms of their performance in translation into mor- Using a smaller vocabulary, we can aggregate
phologically rich languages. To the best of our the low counts and make the distribution smoother.
knowledge, there is no published work on sys- We now define two types of smoothing models for
tematic comparison between different label vocab- Equation 2 using a general word-label mapping c.
ularies, model forms, and training data size for
smoothing phrase translation models—the most 4.1 Mapping All Words at Once (map-all)
basic component in state-of-the-art SMT systems. For the phrase translation model, the simplest for-
Our work fulfills these needs with extensive trans- mulation of vocabulary reduction is obtained by
lation experiments (Section 5) and quantitative replacing all words in the source and target phrases
analysis (Section 6) in a standard phrase-based with the corresponding labels in a smaller space.
SMT framework. Namely, we employ the following probability in-
stead of Equation 2:
3 Word Classes
N (c(f˜), c(ẽ))
pall (f˜|ẽ) = (3)
In this work, we mainly use unsupervised word N (c(ẽ))
classes by Brown et al. (1992) as the reduced vo-
which we call map-all. This model resembles the
cabulary. This section briefly reviews the principle
word class translation model of Wuebker et al.
and properties of word classes.
(2013) except that we allow any kind of word-level
A word-class mapping c is estimated by a clus-
labels.
tering algorithm that maximizes the following ob-
This model generalizes all words of a phrase
jective (Brown et al., 1992):
without distinction between them. Also, the same
I
XX formulation is applied to word-based lexicon mod-
L := p(c(ei )|c(ei−1 )) · p(ei |c(ei )) (1) els.
eI1 i=1
4.2 Mapping Each Word at a Time
(map-each)
for a given monolingual corpus {eI1 },
where each
eI1 is a sentence of length I in the corpus. The More elaborate smoothing can be achieved by gen-
objective guides c to prefer certain collocations of eralizing only a sub-part of the phrase pair. The
class sequences, e.g. an auxiliary verb class should idea is to replace one source word at a time with
succeed a class of pronouns or person names. its respective label. For each source position j, we
Consequently, the resulting c groups words ac- also replace the target words aligned to the source
cording to their syntactic or semantic similarity. word fj . For this purpose, we let aj ⊆ {1, ..., |ẽ|}
Word classes have a big advantage for our com- denote a set of target positions aligned to j. The
parative study: The structure and size of the class resulting model takes a weighted average of the
vocabulary can be arbitrarily adjusted by the clus- redefined translation probabilities over all source
tering parameters. This makes it possible to prepositions of f˜:
pare easily an abundant set of label vocabularies ˜
|f |
that differ in linguistic coherence and degree of
X N (c(j) (f˜), c(aj ) (ẽ))
peach (f˜|ẽ) = wj · (4)
generalization. j=1
N (c(aj ) (ẽ))
4 Smoothing Models where the superscripts of c indicate the positions

that are mapped onto the label space. Pwj is a
In the standard phrase translation model, the trans- weight for each source position, where j wj =
lation probability for each segmented phrase pair 1. We call this model map-each.
(f˜, ẽ) is estimated by relative frequencies: We illustrate this model with a pair of three-
word phrases: f˜ = [f1 , f2 , f3 ] and ẽ = [e1 , e2 , e3 ]
N (f˜, ẽ) (see Figure 1 for the in-phrase word alignments).
pstd (f˜|ẽ) = (2)
N (ẽ) The map-each model score for this phrase pair is:
Page 66 of 78
e1 5 Experiments
e2
5.1 Setup
e3
f1 f 2 f 3 We evaluate how much the translation quality
is improved by the smoothing models in Sec-
Figure 1: Word alignments of a pair of three-word tion 4. The two smoothing models are trained
phrases. in both source-to-target and target-to-source di-
rections, and integrated as additional features in
the log-linear combination of a standard phrase-
peach ( [f1 , f2 , f3 ] | [e1 , e2 , e3 ] ) = based SMT system (Koehn et al., 2003). We also
test linear interpolation between the standard and
N ([c(f1 ), f2 , f3 ], [c(e1 ), e2 , e3 ]) smoothing models, but the results are generally
w1 ·
N ([c(e1 ), e2 , e3 ]) worse than log-linear interpolation. Note that vo-
N ([f1 , c(f2 ), f3 ], [e1 , e2 , e3 ]) cabulary reduction models by themselves cannot
+ w2 · replace the corresponding standard models, since
N ([e1 , e2 , e3 ])
this leads to a considerable drop in translation
N ([f1 , f2 , c(f3 )], [e1 , c(e2 ), c(e3 )]) quality (Wuebker et al., 2013).
+ w3 ·
N ([e1 , c(e2 ), c(e3 )]) Our baseline systems include phrase transla-
tion models in both directions, word-based lexi-
where the alignments are depicted by line seg- con models in both directions, word/phrase penal-
ments. ties, a distortion penalty, a hierarchical lexicalized
First of all, we replace f1 and also e1 , which is reordering model (Galley and Manning, 2008),
aligned to f1 , with their corresponding labels. As a 4-gram language model, and a 7-gram word
f2 has no alignment points, we do not replace any class language model (Wuebker et al., 2013). The
target word accordingly. f3 triggers the class re- model weights are trained with minimum error
placement of two target words at the same time. rate training (Och, 2003). All experiments are
Note that the model implicitly encapsulates the conducted with an open source phrase-based SMT
alignment information. toolkit Jane 2 (Wuebker et al., 2012).
We empirically found that the map-each model To validate our experimental results, we mea-
performs best with the following weight: sure the statistical significance using the paired
bootstrap resampling method of Koehn (2004).
Every result in this section is marked with ‡ if it
N (c(j) (f˜), c(aj ) (ẽ))
wj = (5) is statistically significantly better than the base-
|f˜|
line with 95% confidence, or with † for 90% con-
N (c(j 0 ) (f˜), c(aj 0 ) (ẽ))
P
j 0 =1 fidence.
which is a normalized count of the generalized 5.2 Comparison of Vocabularies

phrase pair itself. Here, the count is relatively The presented smoothing models are dependent
large when fj , the word to be backed off, is less on the label vocabulary, which is defined by the
frequent than other words in f˜. In contrast, if fj word-label mapping c. Here, we train the models
is a very frequent word and one of the other words with various label vocabularies and compare their
in f˜ is rare, the count becomes low due to that rare smoothing performance.
word. The same logic holds for target words in ẽ. The experiments are done on the IWSLT 2012
After all, Equation 5 carries more weight when a German→English shared translation task. To
rare word is replaced with its label. The intuition rapidly perform repetitive experiments, we train
is that a rare word is the main reason for unstable the translation models with the in-domain TED
counts and should be backed off above all. We use portion of the dataset (roughly 2.5M running
this weight for all experiments in the next section. words for each side). We run the monolingual
In contrast, the map-all model merely replace word clustering algorithm of (Botros et al., 2015)
all words at one time and ignore alignments within on each side of the parallel training data to obtain
phrase pairs. class label vocabularies (Section 3).
Page 67 of 78
29.4
We carry out comparative experiments regard- map-each
map-all
ing the three factors of the clustering algorithm: ‡ Baseline
29.0 ‡
‡ † † ‡ ‡
BLEU [%]
‡ †
1) Clustering iterations. It is shown that the ‡ ‡ ‡

‡ ‡ ‡ ‡
28.6 ‡
number of iterations is the most influential
factor in clustering quality (Och, 1995). We 28.2
0 5 10 15 20 25 30 35
now verify its effect on translation quality clustering iterations
when the clustering is used for phrase table

smoothing. Figure 3: B LEU scores for clustering iterations
As we run the clustering algorithm, we ex- when using a fixed set of model weights. The
tract an intermediate class mapping for each weights that produce the best results in Figure 2
iteration and train the smoothing models with are chosen.
it. The model weights are tuned for each it-
eration separately. The B LEU scores of the 2) Initialization of the clustering. Since the
tuned systems are given in Figure 2. We use clustering process has no significant impact
100 classes on both source and target sides. on the translation quality, we hypothesize that
the initialization may dominate the cluster-
29.4
map-each ing. We compare five different initial class
map-all
29.0
Baseline mappings:
‡ †
BLEU [%]
‡ ‡
‡
‡ • random: randomly assign words to
28.6 ‡ † classes
‡ ‡ †
• top-frequent (default): top-frequent
28.2
0 5 10 15 20 25 30 35
clustering iterations
words have their own classes, while all
other words are in the last class
Figure 2: B LEU scores for clustering iterations • same-countsum: each class has almost
when using individually tuned model weights for the same sum of word unigram counts
each iteration. Dots indicate those iterations in • same-#words: each class has almost the
which the translation is performed. same number of words
• count-bins: each class represents a bin
The score does not consistently increase or of the total count range
decrease over the iterations; it is rather on a
similar level (± 0.2% B LEU) for all settings B LEU T ER
with slight fluctuations. This is an important Initialization [%] [%]
clue that the whole process of word clustering
has no meaning in smoothing phrase transla- Baseline 28.3 52.2
tion models. + map-each random 28.9‡ 51.7‡
To see this more clearly, we keep the top-frequent 29.0‡ 51.5‡
model weights fixed over different systems same-countsum 28.8‡ 51.7‡
and run the same set of experiments. In this same-#words 28.9‡ 51.6‡
way, we focus only on the change of label count-bins 29.0‡ 51.4‡
vocabulary, removing the impact of nonde-
terministic model weight optimization. The Table 1: Translation results for various initializa-
results are given in Figure 3. tions of the clustering. 100 classes on both sides.
This time, the curves are even flatter, re-
sulting in only ± 0.1% B LEU difference over Table 1 shows the translation results
the iterations. More surprisingly, the models with the map-each model trained with these
trained with the initial clustering, i.e. when initializations—without running the cluster-
the clustering algorithm has not even started ing algorithm. We use the same set of model
yet, are on a par with those trained with weights used in Figure 3. We find that the
more optimized classes in terms of transla- initialization method also does not affect the
tion quality. translation performance. As an extreme case,
Page 68 of 78
random clustering is also a fine candidate for lation tasks: IWSLT 2012 German→English,
training the map-each model. WMT 2015 Finnish→English, WMT
3) Number of classes. This determines the vo- 2014 English→German, and WMT 2015
cabulary size of a label space, which even- English→Czech. We train 100 classes on each
tually adjusts the smoothing degree. Table side with 30 clustering iterations starting from the
2 shows the translation performance of the default (top-frequent) initialization.
map-each model with a varying number of Table 3 provides the corpus statistics of all
classes. Similarly as before, there is no se- datasets used. Note that a morphologically rich
rious performance gap among different word language is on the source side for the first two
classes, and POS tags and lemmas also com- tasks, and on the target side for the last two
form to this trend. tasks. According to the results (Table 4), the map-
However, we observe a slight but steady each model, which encourages backing off infre-
degradation of translation quality (≈ -0.2% quent words, performs consistently better (maxi-
B LEU) when the vocabulary size is larger mum +0.5% B LEU, -0.6% T ER) than the map-all
than a few hundreds. We also lose statisti- model in all cases.
cal significance for B LEU in these cases. The 5.4 Comparison of Training Data Size
reason could be: If the label space becomes
larger, it gets closer to the original vocabulary Lastly, we analyze the smoothing performance
and therefore the smoothing model provides for different training data sizes (Figure 4). The
less additional information to add to the stan- improvement of B LEU score over the baseline
dard phrase translation model. decreases drastically when the training data get
smaller. We argue that this is because the smooth-
ing models are only the additional scores for the
#vocab B LEU T ER
phrases seen in the training data. For smaller train-
(source) [%] [%]
ing data, we have more out-of-vocabulary (OOV)
Baseline 28.3 52.2 words in the test set, which cannot be handled by
+ map-each 100 29.0‡ 51.5‡ the presented models.
(word class) 200 28.9† 51.6‡
16 35
500 28.7 51.8‡
1000 28.7 51.8‡ 15
10000 28.7 51.9† 30

14
+ map-each (POS) 52 28.9† 51.5‡
OOV rate [%]
51.7‡ 13
BLEU [%]
+ map-each (lemma) 26744 28.8 map-each

Baseline 25
12 OOV rate
Table 2: Translation results for different vocabu-
lary sizes. 11
20
10
The series of experiments show that the map-
each model performs very similar across vocab- 9 15
0 5 10 15 20 25
running words [M]
ulary size and its structure. From our internal ex-
periments, this argument also holds for the map-all
model. The results do not change even when we Figure 4: B LEU scores and OOV rates for
use a different clustering algorithm, e.g. bilingual the varying training data portion of WMT 2015
clustering (Och, 1999). For the translation perfor- Finnish→English data.
mance, the more important factor is the log-linear
model training to find an optimal set of weights for 6 Analysis
the smoothing models.
In Section 5.2, we have shown experimentally that
5.3 Comparison of Smoothing Models more optimized or more fine-grained classes do
Next, we compare the two smoothing models not guarantee better smoothing performance. We
by their performance in four different trans- now verify by examining translation outputs that
Page 69 of 78
IWSLT 2012 WMT 2015 WMT 2014 WMT 2015

German English Finnish English English German English Czech
Sentences 130k 1.1M 4M 0.9M
Running Words 2.5M 2.5M 23M 32M 104M 105M 23.9M 21M
Vocabulary 71k 49k 509k 88k 648k 659k 161k 345k
Table 3: Bilingual training data statistics for IWSLT 2012 German→English, WMT 2015
Finnish→English, WMT 2014 English→German, and WMT 2015 English→Czech tasks.
de-en fi-en en-de en-cs

B LEU T ER B LEU T ER B LEU T ER B LEU T ER
[%] [%] [%] [%] [%] [%] [%] [%]
Baseline 28.3 52.2 15.1 72.6 14.6 69.8 15.3 68.7
+ map-all 28.6‡ 51.6‡ 15.3‡ 72.5 14.8‡ 69.4‡ 15.4‡ 68.2‡
+ map-each 29.0‡ 51.4‡ 15.8‡ 72.0‡ 15.1‡ 69.0‡ 15.8‡ 67.6‡
Table 4: Translation results for IWSLT 2012 German→English, WMT 2015 Finnish→English, WMT
2014 English→German, and WMT 2015 English→Czech tasks.
Top 200 T ER-improved Sentences

Common Input Same Translation
Model Classes #vocab [%] [%]
map-each optimized 100 - -
non-optimized 100 89.5 89.9
random 100 88.5 89.8
lemma 26744 87.0 92.6
map-all optimized 100 56.0 54.5
Table 5: Comparison of translation outputs for the smoothing models with different vocabularies. “op-
timized” denotes 30 iterations of the clustering algorithm, whereas “non-optimized” means the initial
(default) clustering.
the same level of performance is not by chance but cates that two systems are particularly effective in
due to similar hypothesis scoring across different a large common part of the test set, showing that
systems. they behaved analogously in the search process.
Given a test set, we compare its translations The numbers in this column are computed against
generated from different systems as follows. First, the map-each model setup trained with 100 opti-
for each translated set, we sort the sentences by mized word classes (first row). For all map-each
how much the sentence-level T ER is improved settings, the overlap is very large—around 90%.
over the baseline translation. Then, we select the To investigate further, we count how often the
top 200 sentences from this sorted list, which rep- two translations of a single input are identical (the
resent the main contribution to the decrease of last column). This is normalized by the number
T ER. In Table 5, we compare the top 200 T ER- of common input sentences in the top 200 lists be-
improved translations of the map-each model se- tween two systems. It is a straightforward measure
tups with different vocabularies. to see if two systems discriminate translation hy-
In the fourth column, we trace the input sen- potheses in a similar manner. Remarkably, all sys-
tences that are translated by the top 200 lists, and tems equipped with the map-each model produce
count how many of those inputs are overlapped exactly the same translations for the most part of
across given systems. Here, a large overlap indi- the top 200 T ER-improved sentences.
Page 70 of 78
We can see from this analysis that, even though For future work, we plan to perform a similar
a smoothing model is trained with essentially dif- set of comparative experiments on neural machine
ferent vocabularies, it helps the translation process translation systems.
in basically the same manner. For comparison, we
also compute the measures for a map-all model, Acknowledgments
which are far behind the high similarity among the
This paper has received funding from the Euro-
map-each models. Indeed, for smoothing phrase
pean Union’s Horizon 2020 research and innova-
translation models, changing the model structure
tion programme under grant agreement no 645452
for vocabulary reduction exerts a strong influence
(QT21).
in the hypothesis scoring, yet changing the vocab-
ulary does not.
References
7 Conclusion
Arianna Bisazza and Christof Monz. 2014. Class-
Reducing vocabulary using word-label mapping is based language modeling for translating into mor-
a simple and effective way of smoothing phrase phologically rich languages. In Proceedings of 25th
International Conference on Computational Lin-
translation models. By mapping each word in a guistics (COLING 2014), pages 1918–1927, Dublin,
phrase at a time, the translation quality can be im- Ireland, August.
proved by up to +0.7% B LEU and -0.8% T ER over
a standard phrase-based SMT baseline, which is Rami Botros, Kazuki Irie, Martin Sundermeyer, and
Hermann Ney. 2015. On efficient training of word
superior to Wuebker et al. (2013). classes and their application to recurrent neural net-
Our extensive comparison among various vo- work language models. In Proceedings of 16th An-
cabularies shows that different word-label map- nual Conference of the International Speech Com-
pings are almost equally effective for smoothing munication Association (Interspeech 2015), pages
1443–1447, Dresden, Germany, September.
phrase translation models. This allows us to use
any type of word-level label, e.g. a randomized Peter F. Brown, Peter V. deSouza, Robert L. Mer-
vocabulary, for the smoothing, which saves a con- cer, Vincent J. Della Pietra, and Jenifer C. Lai.
siderable amount of effort in optimizing the struc- 1992. Class-based n-gram models of natural lan-
guage. Computational Linguistics, 18(4):467–479,
ture and granularity of the label vocabulary. Our
December.
analysis on sentence-level T ER demonstrates that
the same level of performance stems from the Colin Cherry. 2013. Improved reordering for phrase-
analogous hypothesis scoring. based translation using sparse features. In Pro-
ceedings of 2013 Conference of the North American
We claim that this result emphasizes the fun-
Chapter of the Association for Computational Lin-
damental sparsity of the standard phrase transla- guistics: Human Language Technologies (NAACL-
tion model. Too many target phrase candidates HLT 2013), pages 22–31, Atlanta, GA, USA, June.
are originally undervalued, so giving them any
reasonable amount of extra probability mass, e.g. Nadir Durrani, Philipp Koehn, Helmut Schmid, and
Alexander Fraser. 2014. Investigating the useful-
by smoothing with random classes, is enough to ness of generalized word representations in smt. In
broaden the search space and improve translation Proceedings of 25th Annual Conference on Com-
quality. Even if we change a single parameter in putational Linguistics (COLING 2014), pages 421–
estimating the label space, it does not have a sig- 432, Dublin, Ireland, August.
nificant effect on scoring hypotheses, where many Michel Galley and Christopher D. Manning. 2008.
other models than the smoothed translation model, A simple and effective hierarchical phrase reorder-
e.g. language models, are involved with large ing model. In Proceedings of 2008 Conference on
weights. Nevertheless, an exact linguistic expla- Empirical Methods in Natural Language Process-
ing (EMNLP 2008), pages 848–856, Honolulu, HI,
nation is still to be discovered. USA, October.
Our results on varying training data show that
vocabulary reduction is more suitable for large- Barry Haddow, Matthias Huck, Alexandra Birch, Niko-
scale translation setups. This implies that OOV lay Bogoychev, and Philipp Koehn. 2015. The edin-
burgh/jhu phrase-based machine translation systems
handling is more crucial than smoothing phrase for wmt 2015. In Proceedings of 2016 EMNLP 10th
translation models for low-resource translation Workshop on Statistical Machine Translation (WMT
tasks. 2016), pages 126–133, Lisbon, Portugal, September.
Page 71 of 78
Philipp Koehn and Hieu Hoang. 2007. Factored trans- source phrase-based and hierarchical statistical ma-
lation models. In Proceedings of 2007 Joint Con- chine translation. In Proceedings of 24th Inter-
ference on Empirical Methods in Natural Language national Conference on Computational Linguistics
Processing and Computational Natural Language (COLING 2012), pages 483–492, Mumbai, India,
Learning (EMNLP-CoNLL 2007), pages 868–876, December.
Prague, Czech Republic, June.
Joern Wuebker, Stephan Peitz, Felix Rietig, and Her-
Philipp Koehn, Franz Josef Och, and Daniel Marcu. mann Ney. 2013. Improving statistical machine
2003. Statistical phrase-based translation. In Pro- translation with word class models. In Proceedings
ceedings of 2003 Conference of the North American of 2013 Conference on Empirical Methods in Nat-
Chapter of the Association for Computational Lin- ural Language Processing (EMNLP 2013), pages
guistics on Human Language Technology (NAACL- 1377–1381, Seattle, USA, October.
HLT 2003), pages 48–54, Edmonton, Canada, May.
Richard Zens, Franz Josef Och, and Hermann Ney.
Philipp Koehn. 2004. Statistical significance tests for 2002. Phrase-based statistical machine translation.
machine translation evaluation. In Proceedings of In Matthias Jarke, Jana Koehler, and Gerhard Lake-
2004 Conference on Empirical Methods in Natural meyer, editors, 25th German Conference on Artifi-
Language Processing (EMNLP 2004), pages 388– cial Intelligence (KI2002), volume 2479 of Lecture
395, Barcelona, Spain, July. Notes in Artificial Intelligence (LNAI), pages 18–32,
Aachen, Germany, September. Springer Verlag.
Philipp Koehn. 2010. Statistical Machine Translation.
Cambridge University Press, New York, NY, USA.
Terry Koo, Xavier Carreras, and Michael Collins.

2008. Simple semi-supervised dependency parsing.
In Proceedings of 46th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2008),
pages 595–603, Columbus, OH, USA, June.
Scott Miller, Jethran Guinness, and Alex Zamanian.

2004. Name tagging with word clusters and discrim-
inative training. In Proceedings of 2004 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies (NAACL-HLT 2004), pages 337–342,
Boston, MA, USA, May.
Franz Josef Och. 1995. Maximum-likelihood-

schätzung von wortkategorien mit verfahren
der kombinatorischen optimierung. Studienar-
beit, Friedrich-Alexander-Universität Erlangen-
Nürnberg, Erlangen, Germany, May.
Franz Josef Och. 1999. An efficient method for de-

termining bilingual word classes. In Proceedings of
9th Conference on European Chapter of Association
for Computational Linguistics (EACL 1999), pages
71–76, Bergen, Norway, June.
Franz Josef Och. 2003. Minimum error rate training

in statistical machine translation. In Proceedings
of 41st Annual Meeting of the Association for Com-
putational Linguistics (ACL 2003), pages 160–167,
Sapporo, Japan, July.
Christian Rishøj and Anders Søgaard. 2011. Factored

translation with unsupervised word clusters. In Pro-
ceedings of 2011 EMNLP 6th Workshop on Statisti-
cal Machine Translation (WMT 2011), pages 447–
451, Edinburgh, Scotland, July.
Joern Wuebker, Matthias Huck, Stephan Peitz, Malte

Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab
Mansour, and Hermann Ney. 2012. Jane 2: Open
Page 72 of 78
I BIRA: Improved Predictive Exchange Word Clustering
BIRA: Improved Predictive Exchange Word Clustering
Jon Dehdari1,2 and Liling Tan2 and Josef van Genabith1,2

1
DFKI, Saarbrücken, Germany
{jon.dehdari,josef.van genabith}@dfki.de
2
University of Saarland, Saarbrücken, Germany
liling.tan@uni-saarland.de
Abstract (Zollmann and Vogel, 2011), and OSM (Durrani et

al., 2014), among many others.
Word clusters are useful for many NLP tasks Word clusterings have also found utility in pars-
including training neural network language ing (Koo et al., 2008; Candito and Seddah, 2010;
models, but current increases in datasets are
Kong et al., 2014), chunking (Turian et al., 2010),
outpacing the ability of word clusterers to han-
dle them. Little attention has been paid thus NER (Miller et al., 2004; Liang, 2005; Ratinov and
far on inducing high-quality word clusters at Roth, 2009; Ritter et al., 2011), structure transfer
a large scale. The predictive exchange algo- (Täckström et al., 2012), and discourse relation dis-
rithm is quite scalable, but sometimes does covery (Rutherford and Xue, 2014).
not provide as good perplexity as other slower Word clusters also speed up normalization in
clustering algorithms. training neural network and MaxEnt language
We introduce the bidirectional, interpolated, models, via class-based decomposition (Goodman,
refining, and alternating (BIRA) predictive ex- 2001a). This reduces the normalization
change algorithm. It improves upon the pre-
p time from
O(|V |) (the vocabulary size) to ≈ O( |V |) . More
dictive exchange algorithm’s perplexity by up
improvements to O(log(|V |)) are found using hier-
to 18%, giving it perplexities comparable to
the slower two-sided exchange algorithm, and archical softmax (Morin and Bengio, 2005; Mnih
better perplexities than the slower Brown clus- and Hinton, 2009) .
tering algorithm. Our B IRA implementation
is fast, clustering a 2.5 billion token English 2 Word Clustering
News Crawl corpus in 3 hours. It also reduces
machine translation training time while pre- Word clustering partitions a vocabulary V, grouping
serving translation quality. Our implementa- together words that function similarly. This helps
tion is portable and freely available. generalize language and alleviate data sparsity. We
discuss flat clustering in this paper. Flat, or strict
partitioning clustering surjectively maps word types
1 Introduction
onto a smaller set of clusters.
Words can be grouped together into equivalence The exchange algorithm (Kneser and Ney, 1993)
classes to help reduce data sparsity and better gener- is an efficient technique that exhibits a general time
alize data. Word clusters are useful in many NLP ap- complexity of O(|V | × |C| × I), where |V | is the
plications. Within machine translation word classes number of word types, |C| is the number of classes,
are used in word alignment (Brown et al., 1993; and I is the number of training iterations, typically
Och and Ney, 2000), translation models (Koehn and < 20 . This omits the specific method of exchang-
Hoang, 2007; Wuebker et al., 2013), reordering ing words, which adds further complexity. Words
(Cherry, 2013), preordering (Stymne, 2012), target- are exchanged from one class to another until con-
side inflection (Chahuneau et al., 2013), SAMT vergence or I .
1169
Proceedings of NAACL-HLT 2016, pages 1169–1174,

San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics
Page 73 of 78
One of the oldest and still most popular exchange gration of these two models:
algorithm implementations is mkcls (Och, 1995)1 ,
which adds various metaheuristics to escape local P (wi |wi−1 , wi+1 ) , P (wi |ci ) (1)
optima. Botros et al. (2015) introduce their imple- · (λP (ci |wi−1 )
mentation of three exchange-based algorithms. Mar- + (1 − λ)P (ci |wi+1 ))
tin et al. (1998) and Müller and Schütze (2015)2
use trigrams within the exchange algorithm. Clark The interpolation weight λ for the forward direction
(2003) adds an orthotactic bias.3 alternates to 1 − λ every a iterations (i):
The previous algorithms use an unlexicalized (
1 − λ0 if i mod a = 0
(two-sided) language model: P (wi |wi−1 ) = λi := (2)
P (wi |ci ) P (ci |ci−1 ) , where the class ci of the pre- λ0 otherwise
dicted word wi is conditioned on the class ci−1 of Figure 1 illustrates the benefit of this λ-inversion to
the previous word wi−1 . Goodman (2001b) altered help escape local minima, with lower training set
this model so that ci is conditioned directly upon perplexity by inverting λ every four iterations:
wi−1 , hence: P (wi |wi−1 ) = P (wi |ci ) P (ci |wi−1 ) .
This new model fractionates the history more, but it ●
allows for a large speedup in hypothesizing an ex- 800

Clusterer
change since the history doesn’t change. The re-
Perplexity
●
750 ●
Predictive
sulting partially lexicalized (one-sided) class model ● ● ● ● ● ● ● ● ● ● ● ●
Exchange
gives the accompanying predictive exchange al-

700 +Rev
gorithm (Goodman, 2001b; Uszkoreit and Brants, 650
2008) a time complexity of O((B + |V |) × |C| × I) 5

Iteration
10
where B is the number of unique bigrams in the

Figure 1: Training set perplexity using lambda inversion
training set.4 We introduce a set of improvements
(+Rev), using 100M tokens of the Russian News Crawl
to this algorithm to enable high-quality large-scale
(cf. §4.1). Here a = 4, λ0 = 1, and |C| = 800 .
word clusters.
The time complexity is O(2×(B+|V |)×|C|×I) .
3 BIRA Predictive Exchange The original predictive exchange algorithm can be
obtained by setting λ = 1 and a = 0 .5
We developed a bidirectional, interpolated, refining, Another innovation, both in terms of cluster qual-
and alternating (B IRA) predictive exchange algo- ity and speed, is cluster refinement. The vocabulary
rithm. The goal of B IRA is to produce better clusters is initially clustered into |G| sets, where |G| |C|,
by using multiple, changing models to escape local typically 2–10 . After a few iterations (i) of this,
optima. This uses both forward and reversed bigram the full partitioning Cf is explored. Clustering G
class models to improve cluster quality by evaluat- converges very quickly, typically requiring no more
ing log-likelihood on two different models. Unlike than 3 iterations.6
(
using trigrams, bidirectional bigram models only |G| if i ≤ 3
linearly increase time and memory requirements, |C|i := (3)
|C|f otherwise
and in fact some data structures can be shared. The
two directions are interpolated to allow softer inte- The intuition behind this is to group words first
into broad classes, like nouns, verbs, adjectives, etc.
1
https://github.com/moses-smt/mgiza In contrast to divisive hierarchical clustering and
2
http://cistern.cis.lmu.de/marlin coarse-to-fine methods (Petrov, 2009), after the ini-
3
http://bit.ly/1VJwZ7n tial iterations, the algorithm is still able to exchange
4
Green et al. (2014) provide a Free implementation of
5
the original predictive exchange algorithm within the Phrasal The time complexity is O((B + |V |) × |C| × I) if λ = 1 .
6
MT system, at http://nlp.stanford.edu/phrasal . The piecewise definition could alternatively be conditioned
Another implementation is in the Cicada semiring MT system. upon a percentage threshold of moved words.
1170
Page 74 of 78
Russian News Crawl, T=100M, |C|=800 N0 , and due to the power law distribution of the al-
gorithm’s access to these entropy terms, we can pre-
550
compute N · log N up to, say 10e+7, with minimal
memory requirements.8 This results in a consider-
able speedup of around 40% .
Perplexity
500
4 Experiments
450 Our experiments consist of both intrinsic and extrin-
sic evaluations. The intrinsic evaluation measures
dEx iD
i ne ev fin
e ne Re
v
fin
e the perplexity (PP) of two-sided class-based models
Pre +B efi +R Re efi i+ Re
+R +R
+B
iD
i+
+R
ev +B
iD
i+
Re
v+ for English and Russian, and the extrinsic evalua-
i D
+B tion measures B LEU scores of phrase-based MT of
Figure 2: Development set PP of combinations of improve- Russian↔English and Japanese↔English texts.
ments to predictive exchange (cf. §3), using 100M tokens of the
Russian News Crawl, with 800 word classes. 4.1 Class-based Language Model Evaluation
any word to any cluster—there is no hard constraint In this task we used 400, 800, and 1200 classes
that the more refined partitions be subsets of the ini- for English, and 800 classes for Russian. The data
tial coarser partitions. This gives more flexibility comes from the 2011–2013 News Crawl monolin-
in optimizing on log-likelihood, especially given the gual data of the WMT task.9 For these experiments
noise that naturally arises from coarser clusterings. the data was deduplicated, shuffled, tokenized, digit-
We explored cluster refinement over more stages conflated, and lowercased. In order to have a large
than just two, successively increasing the number of test set, one line per 100 of the resulting (shuffled)
clusters. We observed no improvement over the two- corpus was separated into the test set.10 The min-
stage method described above. imum count threshold was set to 3 occurrences in
Each B IRA component can be applied to any the training set. Table 1 shows information on the
exchange-based clusterer. The contributions of each resulting corpus.
of these are shown in Figure 2, which reports the
Corpus Tokens Types Lines
development set perplexities (PP) of all combina- English Train 1B 2M 42M
tions of B IRA components over the original pre- English Test 12M 197K 489K
dictive exchange algorithm. The data and con- Russian Train 550M 2.7M 31M
figurations are discussed in more detail in Sec- Russian Test 6M 284K 313K
tion 4. The greatest PP reduction is due to using
Table 1: Monolingual training & test set sizes.
lambda inversion (+Rev), followed by cluster re-
finement (+Refine), then interpolating the bidirec- The clusterings are evaluated on the PP of an ex-
tional models (+BiDi), with robust improvements ternal 5-gram unidirectional two-sided class-based
by using all three of these—an 18% reduction in language model (LM). The n-gram-order interpola-
perplexity over the predictive exchange algorithm. tion weights are tuned using a distinct development
We have found that both lambda inversion and clus- set of comparable size and quality as the test set.
ter refinement prevent early convergence at local op- Table 2 and Figure 3 show perplexity results us-
tima, while bidirectional models give immediate and ing a varying number of classes. Two-sided ex-
consistent training set PP improvements, but this is change gives the lowest perplexity across the board,
attenuated in a unidirectional evaluation. although this is within a two-sided LM evaluation.
We observed that most of the computation for the of a given word followed by a given class.
predictive exchange algorithm is spent on the log- 8
This was independently discovered in Botros et al. (2015).
arithm function, calculating δ ← δ − N (w, c) · 9
http://www.statmt.org/wmt15/
log N (w, c) .7 Since the codomain of N (w, c) is translation-task.html
10
The data setup script is at http://www.dfki.de/
7
δ is the change in log-likelihood, and N (w, c) is the count ˜jode03/naacl2016.sh .
1171
Page 75 of 78
English News Crawl, T = 109

Training Set mkcls B IRA Brown Phrasal
220 EN, |C| = 400 39.0 1.0 2.3 3.1
EN, |C| = 800 48.8 1.4 12.5 5.1
200 Clusterer
EN, |C| = 1200 68.8 1.7 25.5 6.2
RU, |C| = 800 75.0 1.5 14.6 5.5
●
● BIRA
Perplexity
Brown
180
Table 3: Clustering times (hours) of full training sets. Mkcls
2−Sided Exchange
implements two-sided exchange, and Phrasal implements one-
Pred. Exchange
160 ●
sided predictive exchange.
140 ●
400 600 800 1000 1200

The original predictive exchange algorithm has
Number of Classes
a more fractionated history than the two-sided
Figure 3: 5-gram two-sided class-based LM perplexities for
exchange algorithm. Interestingly, increasing the
various clusterers on English News Crawl varying the number
number of clusters causes a convergence in the
of classes.
word clusterings themselves, while also causing
a divergence in the time complexities of these
two varieties of the exchange algorithm. The
We also evaluated clusters derived from word2vec metaheuristic techniques employed by the two-
(Mikolov et al., 2013) using various configura- sided clusterer mkcls can be applied to other
tions11 , and all gave poor perplexities. B IRA gives exchange-based clusterers—including ours—for
better perplexities than both the original predictive further improvements.
exchange algorithm and Brown clusters.12 The Rus- Table 3 presents wall clock times using the full
sian experiments yielded higher perplexities for all training set, varying the number of word classes
clusterings, but otherwise the same comparative re- |C| (for English).14 The predictive exchange-based
sults. clusterers (B IRA and Phrasal) exhibit slow increases
Training Set 2-Side Ex. B IRA Brown Pred. Ex.
in time as the number of classes increases, while the
EN, |C| = 400 193.3 197.3 201.8 220.5 others (Brown and mkcls) are much more sensi-
EN, |C| = 800 155.0 158.1 160.2 178.3 tive to |C| . Our B IRA-based clusterer is three times
EN, |C| = 1200 138.4 140.4 141.5 157.6 faster than Phrasal for all these sets.
RU, |C| = 800 322.4 340.7 350.4 389.3 We performed an additional experiment, adding
more English News Crawl training data.15 Our
Table 2: 5-gram two-sided class-based LM perplexities.
implementation took 3.0 hours to cluster 2.5 bil-
lion training tokens, with |C| = 800 using modest
In general Brown clusters give slightly worse
hardware.14
results relative to exchange-based clusters, since
Brown clustering requires an early, permanent 4.2 Machine Translation Evaluation
placement of frequent words, with further re-
We also evaluated the B IRA predictive exchange al-
strictions imposed on the |C|-most frequent
gorithm extrinsically in machine translation. As dis-
words (Liang, 2005).13 Liang-style Brown cluster-
cussed in Section 1, word clusters are employed in a
ing is only efficient on a small number of clusters,
variety of ways within machine translation systems,
since there is a |C|2 term in its time complexity.
the most common of which is in word alignment
11
Negative sampling & hierarchical softmax; CBOW & skip- where mkcls is widely used. As training sets get
gram; various window sizes; various dimensionalities. larger every year, mkcls struggles to keep pace, and
12
For the two-sided exchange we used mkcls; for the origi-
14
nal pred. exchange we used Phrasal’s clusterer; for Brown clus- All time experiments used a 2.4 GHz Opteron 8378 featur-
tering we used Percy Liang’s brown-cluster (329dc). All had ing 16 threads.
15
min-count=3, and all but mkcls (which is not multithreaded) Adding years 2008–2010 and 2014 to the existing training
had threads=12, iterations=15. data. This training set was too large for the external class-based
13
Recent work by Derczynski and Chester (2016) loosens LM to fit into memory, so no perplexity evaluation of this clus-
some restrictions on Brown clustering. tering was possible.
1172
Page 76 of 78
is a substantial time bottleneck in MT pipelines with two-sided models generally give better perplexity in
large datasets. class-based LM experiments. Our paper shows that
We used data from the Workshop on Ma- B IRA-based predictive exchange clusters are com-
chine Translation 2015 (WMT15) Russian↔English petitive with two-sided clusters even in a two-sided
dataset and the Workshop on Asian Translation 2014 evaluation. They also give better perplexity than the
(WAT14) Japanese↔English dataset (Nakazawa et original predictive exchange algorithm and Brown
al., 2014). Both pairs used standard configurations, clustering.
like truecasing, MeCab segmentation for Japanese, The software is freely available at https://
MGIZA alignment, grow-diag-final-and phrase ex- github.com/jonsafari/clustercat .
traction, phrase-based Moses, quantized KenLM 5-
gram modified Kneser-Ney LMs, and M ERT tuning. Acknowledgements
We would like to thank Hermann Ney and Kazuki
|C| EN-RU RU-EN EN-JA JA-EN
10 20.8→20.9∗ 26.2→26.0 23.5→23.4 16.9→16.8 Irie, as well as the reviewers for their useful com-
50 21.0→21.2∗ 25.9→25.7 24.0→23.7∗ 16.9→16.9 ments. This work was supported by the QT21
100 20.4→21.1 25.9→25.8 23.8→23.5 16.9→17.0 project (Horizon 2020 No. 645452).
200 21.0→20.8 25.8→25.9 23.8→23.4 17.0→16.8
500 20.9→20.9 25.8→25.9∗ 24.0→23.8 16.8→17.1∗
1000 20.9→21.1 25.9→26.0∗∗ 23.6→23.5 16.9→17.1
References
Table 4: B LEU scores (mkcls→B IRA) and significance across Rami Botros, Kazuki Irie, Martin Sundermeyer, and
cluster sizes ( |C| ). Hermann Ney. 2015. On Efficient Training of
Word Classes and their Application to Recurrent Neu-
The B LEU score differences between using ral Network Language Models. In Proceedings of
mkcls and our B IRA implementation are small but INTERSPEECH-2015, pages 1443–1447, Dresden,
Germany.
there are a few statistically significant changes, us-
Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della
ing bootstrap resampling (Koehn, 2004). Table 4
Pietra, and Robert L. Mercer. 1993. The Mathematics
presents the B LEU score changes across varying of Statistical Machine Translation: Parameter Estima-
cluster sizes (*: p-value < 0.05, **: p-value < 0.01). tion. Computational Linguistics, 19(2):263–311.
M ERT tuning is quite erratic, and some of the B LEU Marie Candito and Djamé Seddah. 2010. Pars-
differences could be affected by noise in the tuning Word Clusters. In Proceedings of the NAACL
ing process in obtaining quality weight values. Us- HLT 2010 First Workshop on Statistical Parsing of
ing our B IRA implementation reduces the translation Morphologically-Rich Languages, pages 76–84, Los
model training time with 500 clusters from 20 hours Angeles, CA, USA.
using mkcls (of which 60% of the time is spent on Victor Chahuneau, Eva Schlinger, Noah A. Smith, and
clustering) to just 8 hours (of which 5% is spent on Chris Dyer. 2013. Translating into Morphologically
Rich Languages with Synthetic Phrases. In Proceed-
clustering).
ings of EMNLP, pages 1677–1687, Seattle, WA, USA.
5 Conclusion Colin Cherry. 2013. Improved Reordering for Phrase-
Based Translation using Sparse Features. In Proceed-
We have presented improvements to the predictive ings of NAACL-HLT, pages 22–31, Atlanta, GA, USA.
exchange algorithm that address longstanding draw- Alexander Clark. 2003. Combining Distributional and
backs of the original algorithm compared to other Morphological Information for Part of Speech Induc-
tion. In Proceedings of EACL, pages 59–66.
clustering algorithms, enabling new directions in us-
Leon Derczynski and Sean Chester. 2016. Generalised
ing large scale, high cluster-size word classes in
Brown Clustering and Roll-up Feature Generation. In
NLP. Proceedings of AAAI, Phoenix, AZ, USA.
Botros et al. (2015) found that the one-sided Nadir Durrani, Philipp Koehn, Helmut Schmid, and
model of the predictive exchange algorithm pro- Alexander Fraser. 2014. Investigating the Usefulness
duces better results for training LSTM-based lan- of Generalized Word Representations in SMT. In Pro-
guage models compared to two-sided models, while ceedings of Coling, pages 421–432, Dublin, Ireland.
1173
Page 77 of 78
Joshua Goodman. 2001a. Classes for Fast Maximum Proceedings of NAACL, pages 526–536, Denver, CO,
Entropy Training. In Proceedings of ICASSP, pages USA.
561–564. Toshiaki Nakazawa, Hideya Mino, Isao Goto, Sadao
Joshua T. Goodman. 2001b. A Bit of Progress in Lan- Kurohashi, and Eiichiro Sumita. 2014. Overview of
guage Modeling, Extended Version. Technical Report the first Workshop on Asian Translation. In Proceed-
MSR-TR-2001-72, Microsoft Research. ings of the Workshop on Asian Translation (WAT).
Spence Green, Daniel Cer, and Christopher Manning. Franz Josef Och and Hermann Ney. 2000. A Comparison
2014. An Empirical Comparison of Features and Tun- of Alignment Models for Statistical Machine Trans-
ing for Phrase-based Machine Translation. In Proc. of lation. In Proceedings of Coling, pages 1086–1090,
WMT, pages 466–476, Baltimore, MD, USA. Saarbrücken, Germany.
Reinhard Kneser and Hermann Ney. 1993. Im- Franz Josef Och. 1995. Maximum-Likelihood-
proved clustering techniques for class-based statis- Schätzung von Wortkategorien mit Verfahren der kom-
tical language modelling. In Proceedings of EU- binatorischen Optimierung. Bachelor’s thesis (Studi-
ROSPEECH’93, pages 973–976, Berlin, Germany. enarbeit), Friedrich-Alexander-Universität Erlangen-
Philipp Koehn and Hieu Hoang. 2007. Factored Trans- Nürnburg, Germany.
Slav Petrov. 2009. Coarse-to-Fine Natural Language
lation Models. In Proceedings of EMNLP-CoNLL,
Processing. Ph.D. thesis, University of California at
pages 868–876, Prague, Czech Republic.
Berkeley, Berkeley, CA, USA.
Philipp Koehn. 2004. Statistical significance tests for
Lev Ratinov and Dan Roth. 2009. Design Challenges
machine translation evaluation. In Proceedings of
and Misconceptions in Named Entity Recognition. In
EMNLP, pages 388–395.
Proc. of CoNLL, pages 147–155, Boulder, CO, USA.
Lingpeng Kong, Nathan Schneider, Swabha Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
Swayamdipta, Archna Bhatia, Chris Dyer, and 2011. Named Entity Recognition in Tweets: An Ex-
Noah A. Smith. 2014. A Dependency Parser for perimental Study. In Proceedings of EMNLP 2011,
Tweets. In Proceedings of EMNLP, pages 1001–1012, pages 1524–1534, Edinburgh, Scotland.
Doha, Qatar. Attapol Rutherford and Nianwen Xue. 2014. Discover-
Terry Koo, Xavier Carreras, and Michael Collins. 2008. ing Implicit Discourse Relations Through Brown Clus-
Simple Semi-supervised Dependency Parsing. In Pro- ter Pair Representation and Coreference Patterns. In
ceedings of ACL: HLT, pages 595–603, Columbus, Proc. of EACL, pages 645–654, Gothenburg, Sweden.
OH, USA. Sara Stymne. 2012. Clustered Word Classes for Pre-
Percy Liang. 2005. Semi-Supervised Learning for Natu- ordering in Statistical Machine Translation. In Pro-
ral Language. Master’s thesis, MIT. ceedings of the Joint Workshop on Unsupervised and
Sven Martin, Jörg Liermann, and Hermann Ney. 1998. Semi-Supervised Learning in NLP, pages 28–34, Avi-
Algorithms for Bigram and Trigram Word Clustering. gnon, France.
Speech Communication, 24(1):19–37. Oscar Täckström, Ryan McDonald, and Jakob Uszkor-
Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey eit. 2012. Cross-lingual Word Clusters for Direct
Dean. 2013. Efficient Estimation of Word Represen- Transfer of Linguistic Structure. In Proceedings of
tations in Vector Space. In Workshop Proceedings of NAACL:HLT, pages 477–487, Montréal, Canada.
the International Conference on Learning Representa- Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.
tions (ICLR), Scottsdale, AZ, USA. 2010. Word Representations: A Simple and General
Scott Miller, Jethran Guinness, and Alex Zamanian. Method for Semi-Supervised Learning. In Proceed-
2004. Name Tagging with Word Clusters and Discrim- ings of ACL, pages 384–394, Uppsala, Sweden.
inative Training. In Susan Dumais, Daniel Marcu, and Jakob Uszkoreit and Thorsten Brants. 2008. Distributed
Salim Roukos, editors, Proceedings of HLT-NAACL, Word Clustering for Large Scale Class-Based Lan-
pages 337–342, Boston, MA, USA. guage Modeling in Machine Translation. In Proc. of
Andriy Mnih and Geoffrey Hinton. 2009. A Scalable Hi- ACL: HLT, pages 755–762, Columbus, OH, USA.
erarchical Distributed Language Model. In D. Koller, Joern Wuebker, Stephan Peitz, Felix Rietig, and Hermann
D. Schuurmans, Y. Bengio, and L. Bottou, editors, Ad- Ney. 2013. Improving Statistical Machine Translation
vances in NIPS-21, volume 21, pages 1081–1088. with Word Class Models. In Proceedings of EMNLP,
Frederic Morin and Yoshua Bengio. 2005. Hierarchi- pages 1377–1381, Seattle, WA, USA.
cal Probabilistic Neural Network Language Model. In Andreas Zollmann and Stephan Vogel. 2011. A Word-
Proceedings of AISTATS, volume 5, pages 246–252. Class Approach to Labeling PSCFG Rules for Ma-
chine Translation. In Proceedings of ACL-HLT, pages
Thomas Müller and Hinrich Schütze. 2015. Robust Mor-
1–11, Portland, OR, USA.
phological Tagging with Word Representations. In
1174
Page 78 of 78

Attachment - 0 Copy 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attachment - 0 Copy 3

Uploaded by

Copyright:

Available Formats

Ref.

Improved Learning for Machine

Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH),

Dissemination Level: Public

31st July, 2016

Grant agreement no. 645452

© 2016, The Individual Authors

2 Joint Translation and Reordering Sequences 5

3 Alignment-Based Neural Machine Translation 5

4 The QT21/HimL Combined Machine Translation System 6

6 Particle Swarm Optimization for MERT 6

7 CharacTER: Translation Edit Rate on Character Level 6

8 Bag-of-Words Input Features for Neural Network 7

9 Vocabulary Reduction for Phrase Table Smoothing 7

10 Faster and Better Word Classes for Word Alignment 8

Appendix A Joint Translation and Reordering Sequences 10

Appendix B Alignment-Based Neural Machine Translation 21

Appendix C The QT21/HimL Combined Machine Translation System 33

Appendix D Beer for MT evaluation and tuning 45

Appendix E Particle Swarm Optimization Submission for WMT16 Tuning Task 51

Appendix F CharacTER: Translation Edit Rate on Character Level 58

Appendix G Exponentially Decaying Bag-of-Words Input Features 58

Appendix H Study on Vocabulary Reduction for Phrase Table Smoothing 65

Appendix I BIRA: Improved Predictive Exchange Word Clustering 73

2 Joint Translation and Reordering Sequences

3 Alignment-Based Neural Machine Translation

4 The QT21/HimL Combined Machine Translation System

6 Particle Swarm Optimization for MERT

7 CharacTER: Translation Edit Rate on Character Level

8 Bag-of-Words Input Features for Neural Network

9 Vocabulary Reduction for Phrase Table Smoothing

10 Faster and Better Word Classes for Word Alignment

A Joint Translation and Reordering Sequences

A Comparison between Count and Neural Network Models Based on

Abstract This work proposes word-based translation

Figure 1: Overview of the different reordering classes in JTR sequences.

3.1 Sequence Conversion .

k gk sk tk De→En, this led to a baseline weaker by 0.2 B LEU

This equation is realized by a network that uses 5 Evaluation

To include the target side, we provide the forward 3 http://www.statmt.org/wmt15/

IWSLT WMT BOLT

References Yonggang Deng and William Byrne. 2005. Hmm word

B Alignment-Based Neural Machine Translation

Alignment-Based Neural Machine Translation

Tamer Alkhouli, Gabriel Bretschner,

Abstract els into phrase-based decoding, where the mod-

works are used to hypothesize translation candi- 4 Neural Network Models

Figure 1: Examples on resolving word alignments to obtain word affiliations.

b̂i−1 +2 Algorithm 1 Alignment-based Decoder

words. Alignments are hypothesized in the loop Model Combination

IWSLT BOLT training criterion with the bigram assumption. The

test 2010 eval 2011 rable performance is achieved on test 2010.

Table 2 shows the IWSLT German→English re- 7.2 BOLT Chinese→English

test1 test2 29.5

test 2010 eval 2011

and FFAM, respectively. These values are based . ken

on the model weights obtained using MERT. The

results are shown in Table 4. Note that the base-

source sie würden verhungern nicht , und wissen Sie was ?

Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker,

C The QT21/HimL Combined Machine Translation System

The QT21/HimL Combined Machine Translation System

Abstract of substantially improving statistical and machine