Yili Nyang 2022

AN ABSTRACT OF THE DISSERTATION OF
Yilin Yang for the degree of Doctor of Philosophy in Computer Science presented on
June 1, 2022.
Title: Explaining and Improving Neural Machine Translation
Abstract approved:
Prasad Tadepalli & Stefan Lee
Machine Translation, the task of automatically translating between human languages has
been studied for decades. This task is used to be solved by count-based statistical models, e.g.
Phrase-based Statistical Machine Translation (PBSMT) [1], which solves the translation problem
by separately training a statistical language model and a translation model. Recently, Neural
Machine Translation (NMT) is proposed to solve this problem by training a Deep Neural Network
(DNN) [2] in an end-to-end fashion, which demonstrates superior performance and thus becomes
the state-of-the-art approach for the Machine Translation task.
However, black-box DNN models are inscrutable and it is difficult to explain the common
failure modes in NMT systems. In this thesis, we attempt to explain and improve the modern
Neural Machine Translation model by looking closely into multiple aspects of it and proposing
new algorithms and heuristics to improve the translation performance.
During inference time, it is common to adopt the beam search algorithm to explore the
exponential candidate space. Yet, the NMT model is widely found to generate worse translations
with increasing search budgets (i.e. larger beam sizes). This phenomenon, commonly referred
to as the beam search curse [3] hinders the translation performance and limits the beam sizes
to be less than 10. In Chapter 3, we examine the beam search curse in-depth and propose new
rescoring methods to ameliorate it.
With its deeply stacked architecture-identical neural networks, it is nearly impossible to ex-
plain the internal functionalities of DNN-based NMT models. In Chapter 4, we explain the
NMT decoder module-level functionalities through our proposed information probing frame-
work. Somewhat surprisingly, we find that half of its parameters could be dropped with minimal
performance loss [4].
Finally for the recently popularized Multilingual NMT model [5], one model is trained on
all language pairs to translate between all languages. We find that it faces a severe off-target
translation issue, where it frequently produces outputs in the incorrect language. In Chapter 6, we
explain how off-target translation emerges during beam search decoding, and propose a language-
informed beam search algorithm to notably reduce off-target errors solely during decoding time.
In Chapter 5, we propose representation-level and gradient-level regularizations during training
time to significantly improve translation performance and reduce off-target errors.
© Copyrightby Yilin Yang
June 1, 2022
All Rights Reserved
Explaining and Improving Neural Machine Translation
by
Yilin Yang
A DISSERTATION
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Presented June 1, 2022

Commencement June 2023
Doctor of Philosophy dissertation of Yilin Yang presented on June 1, 2022.
APPROVED:
Major Professor, representing Computer Science
Head of the School of Electrical Engineering and Computer Science
Dean of the Graduate School
I understand that my dissertation will become part of the permanent collection of Oregon State
University libraries. My signature below authorizes release of my dissertation to any reader
upon request.
Yilin Yang, Author

ACKNOWLEDGEMENTS
I am ultimately grateful to complete my Ph.D. degree in the company of my advisors, friends,

and my family in China.
First and foremost, I want to thank all the supports I received from my major advisors Prasad
Tadepalli and Stefan Lee. I especially want to thank Prasad’s generous acceptance of me, which
effectively makes me continue my Ph.D. study. I also want to thank Prasad’s generous support to
let me intern at various industrial research labs, where I’m able to meet with new people and new
ideas to build up my thesis. I am very grateful for Stefan’s later engagement as my co-advisor
and with his extensive knowledge and help I am able to achieve much more. I also thank my
first advisor Liang Huang for fortunately bringing me into the research field of Neural Machine
Translation.
I also want to thank my colleagues Longyue Wang and Zhaopeng Tu from Tencent AI Lab,
Akiko Eriguchi, Alexandre Muzio and Hany Hassan from Microsoft Research, Changhan Wang,
Yun Tang, Ann Lee and Wei-Ning Hsu from Meta AI Research.
I want to thank my friends for accompanying me throughout my Ph.D., including but not
limited to Jinta Zheng, Qi Lyu, Jun li, Anurag Koul, Matthew Olson, Juneki Hong, Neale Ratzlaff,
Yiren Wang, Jie hao, Xinwei Geng, Bo He, Shilin He, Yong Wang, Shuo Wang, Danni Chen,
Liuting Chen, Yiying Pu, Shuning Chen, Wei-Peng Lew, Ruoyan Chen, and Xinyao Gao.
Last but not least, I thank my parents abroad for emotionally and financially supporting me
since my birth. They pave the road for me to grow and achieve. Not to mention they strongly
motivated me to start my Ph.D. since they both are PhDs and are currently also professors at
Hebei University of Engineering.
TABLE OF CONTENTS
Page
1 Introduction 1
2 Background 3
2.0.1 NMT Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.0.2 NMT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.0.3 NMT Inference & Beam Search Decoding . . . . . . . . . . . . . . . 3
2.0.4 Multilingual NMT Training . . . . . . . . . . . . . . . . . . . . . . . 5
3 Explaining and Improving the Beam Search Curse 7

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Beam Search Curse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Rescoring Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Previous Rescoring Methods . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2 Rescoring with Length Prediction . . . . . . . . . . . . . . . . . . . . 11
3.4 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.1 Conventional Stopping Criteria . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Optimal Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5.1 Parameter Tuning and Results . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Explaining and Improving the Transformer Decoder 19

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Transformer Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Sub-Layer Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5.1 Representation Evolution Across Layers . . . . . . . . . . . . . . . . 24
4.5.2 Exploitation of Source Information . . . . . . . . . . . . . . . . . . . 27
4.5.3 Information Fusion in Decoder . . . . . . . . . . . . . . . . . . . . . 29
4.6 Transformer Big Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
TABLE OF CONTENTS (Continued)
Page
5 Explaining and Improving Multilingual Off-Target Translation 40

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Representation-Level Regularization: Target Language Prediction (TLP) 41
5.2.2 Gradient-Level Regularization: Target-Gradient-Projection (TGP) . . . 42
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 Datasets: WMT-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Datasets: OPUS-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.3 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.4 Construction of Oracle Data . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.1 TLP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.2 TGP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.3 TGP in a Zero-Shot Setting . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.4 Joint TLP+TGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Discussions on Off-Target Translations . . . . . . . . . . . . . . . . . . . . . . . 53
5.5.1 Off-Targets on English-Centric Pairs . . . . . . . . . . . . . . . . . . 53
5.5.2 Token-Level Off-Targets . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 TGP Explanation and Visualization . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7.1 WMT-10 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7.2 OPUS-100 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7.3 Statistics of the Oracle Data . . . . . . . . . . . . . . . . . . . . . . . 60
6 Explaining and Improving Multilingual Beam Search Decoding 62

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.2 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . 63
6.3 Analyzing Off-Target Occurrence During Beam Search . . . . . . . . . . . . . . 64
6.3.1 Multilingual Beam Search Curse . . . . . . . . . . . . . . . . . . . . 64
6.3.2 Off-Target Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.3 Beam Search Process Analysis . . . . . . . . . . . . . . . . . . . . . 67
6.4 Language-Informed Beam Search (LIBS) . . . . . . . . . . . . . . . . . . . . . 69
TABLE OF CONTENTS (Continued)
Page
6.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.5.1 WMT Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5.2 OPUS-100 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7 Summary & Future Work 77

LIST OF FIGURES
Figure Page
2.1 The Transformer model architecture (image from [7]). . . . . . . . . . . . . . . 4
3.1 As beam size increases beyond 3, BLEU score on the dev set gradually drops.
All terms are calculated by multi-bleu.pl. . . . . . . . . . . . . . . . . . . . . . 8
3.2 Searching algorithm with larger beams generates </s> earlier. We use the average
first, second, and third </s> positions on the dev set as an example. . . . . . . . 9
3.3 Candidate lengths vs. model score. This scatter plot is generated from 242 fin-
ished candidates when translated from one source sequence with beam size 80. 10
3.4 The BLEU scores and length ratios (lr = |y|/|y∗ |) of various rescoring methods.
We can observe our proposed methods successfully tune the length ratio to nearly
1. Therefore, they do not suffer from the brevity penalty of BLEU. . . . . . . . 17
3.5 BLEU scores and length ratios on the dev set over various input sentence lengths. 18
4.1 A sub-layer splitting of Transformer decoder with respect to their functionalities. 22
4.2 Illustration of the information probing model, which reads the representation of
a decoder module (“Input 1”) and the word sequence to recover (“Input 2”), and
outputs the generation probability (“Output”). . . . . . . . . . . . . . . . . . . 25
4.3 Evolution trends of source (upper panel) and target (bottom panel) information
embedded in the decoder modular representations across layers. Lower perplexity
(“PPL”) denotes more information embedded in the representations. . . . . . . 34
4.4 Behavior of the SEM in terms of (a) alignment quality measured in AER (the
lower, the better), and (b) the cumulative coverage of source words. . . . . . . 35
4.5 Effects of the stacking order of TEM and SEM on the En-De dataset. . . . . . . 35
4.6 Effects of the stacking order of TEM and SEM on the En-Zh dataset. . . . . . . 36
4.7 Effects of decoder depths on SEM behaviors on the En-De task. . . . . . . . . 36
4.8 Illustration of (a) three operations within IFM, and (b,c) the source and target
information evolution within IFM on En-De task. . . . . . . . . . . . . . . . . 37
4.9 Illustration of (a) the standard decoder and (b) the simplified decoder. . . . . . 38
LIST OF FIGURES (Continued)
Figure Page
4.10 Comparison of IFM information evolution between the standard and simplified
decoder on En-De. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 The training loss curve of TGP on WMT-10. . . . . . . . . . . . . . . . . . . . 51
5.2 The test BLEU curves of TGP on WMT-10. . . . . . . . . . . . . . . . . . . . 52
5.3 Relationship between the random token-level probability p and the reported
sentence-level off-target rates. Token-level off-targets are introduced by replacing
the in-target token with a random off-target token with a probability p. Analysis
done on the WMT-10 English-free references. . . . . . . . . . . . . . . . . . . 54
5.4 TGP projection visualization on a 2-D case, imagining a third axis pointing
outwards representing the train/dev loss curves. Any point on the plot represents
a model state, where the blue arrow represents the training gradient and the red
arrow represents the dev gradient. . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 TGP gradient visualization on a 2-D case. The red and blue contours represent
train and dev loss level sets. Arrows represent the TGP gradients, where green
arrows represent unprojected gradients, and grey arrows represent projected ones. 59
6.1 The sentence BLEU distribution between source and system translation from
WMT Fr→De “→Source” errors. . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 The sentence BLEU distribution between WMT Fr→De “→English” translation
and Fr→En translation with the same source input. . . . . . . . . . . . . . . . 68
6.3 Translation performance (BLEU and off-target rates) with different α values on
OPUS-100 Fr→De test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 1: Introduction
The field of Machine Translation, automatically translating written text from one natural lan-
guage into another, has experienced a major revolution with the development of Neural Machine
Translation (NMT) techniques [2].
Given their empirical and commercial success, neural network-based machine translation
approaches have become the de-facto standard in both research and industry – displacing prior
linguistically-informed statistical methods. These neural machine translation techniques learn to
automatically translate discretized source sequences into the target language from large parallel
corpora.
Modern neural machine translation approaches tend to be massive models composed of
millions of weights arranged in complex computational structures and are learned from large-
scale data sources. As a result, it is difficult to inspect what these models have learned, diagnose
the causes of deficiencies, or even divide models into conceptually meaningful subcomponents.
During inference time, researchers often adopt the beam search algorithm to explore the
exponential search space. Meanwhile, it is widely observed that the NMT model suffers from
larger beam sizes, often referred to as the “beam search curse” phenomenon. In Chapter 3, we
perform an in-depth study of the beam search curse phenomenon in NMT models and identified
its correlation with an unsatisfactory length bias (i.e. NMT model tends to generate shorter
sentences with larger beam sizes) [3]. Beyond explaining the beam search curse, we propose
several easy-to-use rescoring methods to reduce the negative biases introduced by beam search.
With experiments on the Chinese→English translation task, we empirically show our proposed
re-scoring methods successfully break the beam search curse while NMT performance steadily
improves with larger beam sizes. Results show our proposed method improves BLEU by +3.1
and +3.7 over the baseline with small and large beam sizes respectively. In addition, we show
provably-optimal stopping criteria for all of our proposed rescoring criteria.
Despite NMT models trained in an end-to-end fashion, they are often designed as an encoder-
decoder cascaded architecture. Since the NMT encoder only takes as input the source sentence,
it is only functioning as a source information feature extractor. Meanwhile, since the NMT
decoder takes both source information (from encoder output) and target information (from the
2
generated target prefix) as input, it is hypothesized to serve both as a translation model and a
(target) language model. As a comparison, the PBSMT model separately trains the two models
as statistical phrase-based tables and cascades them during inference.
To better understand how the two models virtually co-exist in the NMT decoder, in Chapter 4
we propose a novel information probing framework to quantify the amount of source and target
information within the internal representations, and how it changes across stages (or modules)
in the network architecture [4]. Our language-based probing experiments suggest some parts
of modern NMT architectures serve limited roles in the computation. By removing these ele-
ments, we are able to halve the number of parameters while maintaining comparable performance.
Smaller models tend to require less time and energy to execute and are thus cheaper to deploy
both at scale and on personal devices.
Recently, [5] proposes and popularizes Multilingual NMT (MNMT) models, which train on
all language pairs to translate between all languages. Due to its deployment efficiency, MNMT
soon becomes the industrial standard to provide translation services. Despite its appeal, the
state-of-the-art MNMT model is often found to translate into a wrong language, a phenomenon
also referred to as “off-target translation”. To explain the off-target translation, in Chapter 5 and 6
we comprehensively categorize the off-target error types and investigate the off-target emergence
during beam search decoding with case studies. Further to reduce the off-target errors, we propose
a joint approach to regularize NMT models at both representation-level and gradient-level during
training. At the representation level, we leverage an auxiliary target language prediction task
to regularize decoder outputs to retain information about the target language. At the gradient
level, we leverage a small amount of direct data (in thousands of sentence pairs) to regularize
model gradients. Our results demonstrate that our approach is highly effective in both reducing
off-target translation occurrences and improving zero-shot translation performance by +5.59 and
+10.38 BLEU on WMT and OPUS datasets respectively. Beyond training-time regularization, we
also propose the language-informed beam search algorithm to significantly reduce the off-target
errors solely during decoding time. Importantly, this means that the proposed technique can
be added post-hoc to reduce off-target translation on any existing multilingual model. Results
on two large-scale datasets (i.e. WMT and OPUS-100) demonstrate our LIBS algorithm could
significantly reduce the off-target rate and improve the translation performance by +1 BLEU.
3
Chapter 2: Background
2.0.1 NMT Training

[2] proposes to borrow an encoder-decoder architecture with an attention mechanism to solve the
MT task, which could be implemented with an RNN [2], CNN [6], or Transformer [7] model. An
NMT encoder inputs a source sequence x = (x1 , ..., x|x| ), and produces a sequence of hidden
states. For each time step, the NMT decoder will predict the probability of the next output word
given the encoder states and the previously generated prefix. Given the source sentence x and
the parallel target sentence y = (y1 , y2 , ..., y|y| ), the NMT model is trained with the following
cross-entropy loss (i.e. negative log-likelihood):
|y|
X
LNMT = − log Pθ (yt | x, y1..(t−1) ) (2.1)
t=1
where Pθ denotes the NMT model probabilities.
2.0.2 NMT Model

The state-of-the-art NMT model usually has a cascaded encoder-decoder architecture, where
the encoder only takes as input the source sequence, and the decoder inputs both encoder states
and the generation prefix. Since [7], the Transformer model has become the state-of-the-art
architecture for NMT, as well as for many other tasks. The transformer model (shown in Figure
2.1) has identically stacked encoder and decoder layers, where one encoder layer includes Self-
Attention, Feed-Forward, and LayerNorm operations, and one decoder layer additionally includes
a Cross-Attention operation.
2.0.3 NMT Inference & Beam Search Decoding

During inference, the trained NMT model would generate the target sequence starting from
the begin-of-sentence symbol (<s>) till the end-of-sentence symbol (</s>). Since at each target
4
Figure 2.1: The Transformer model architecture (image from [7]).

5
length there are an exponential number of candidates, common approaches would implement
beam search or greedy search (i.e. beam size equals one) over the search space. When doing the
greedy search, at time step i, the decoder will choose the word with the highest probability as yi .
yi ← argmax p(wordn |x1..m , y1..i−1 ) (2.2)
The decoder will continue generating words until it emits </s>. In the end, the generated hypoth-
esis is y = (y1 , ..., y|y| ) where y|y| = </s>, with a model score
P|y|
S(x, y) = i=1 log p(yi | x, y1..{i−1} ) (2.3)
As greedy search only explores a single path in an exponentially large search space, re-
searchers always adopt beam search to improve the search quality. Beam search is an approximate
fixed-width breadth-first search that extends a set of b candidate partial decodings (i.e. beams)
each time step and then keeps only the top b. Formally, at step i with beam size b, the active
candidate set Bi is an ordered list of size b:
B0 ← {⟨0.0, <s>⟩} (2.4)

Bt ← topb {⟨s · pθ (yt | x, y), y ◦ yt ⟩ | ⟨s, y⟩ ∈ Bt−1 , yt ∈ V} (2.5)
In the most naive case, after reaching the maximum length (a hard limit), we get N possible
candidate sequences {y1 , ..., yN }. The default strategy chooses the one with the highest model
score.
2.0.4 Multilingual NMT Training

Recently proposed by [5], an NMT model could serve all translation directions when trained
on all bilingual parallel corpora. Before concatenating all bilingual data, it appends an artificial
token to each source sentence to denote its parallel target language.
Specifically, given a source sentence xi = (xi1 , xi2 , ..., xi|xi | ) in language i and the parallel
target sentence yj = (y1j , y2j , ..., y|y
j
j | ) in language j, the multilingual model is trained with the
6
following cross-entropy loss:
|y |j
log Pθ (ytj | xi , ⟨j⟩, y1..(t−1)

j
X
LNMT = − ) (2.6)
t=1
where ⟨j⟩ is the artificial token specifying the desired target language, and Pθ is parameterized
exactly as the bilingual NMT model.
The state-of-the-art multilingual system is trained on the concatenated parallel corpus of all
available language pairs in both forward and backward directions, which is also referred to as a
many-to-many multilingual system. To balance the training batches between high-resource and
low-resource language pairs, researchers often adopt a temperature-based sampling to up/down-
sample bilingual data accordingly [8].
7
Chapter 3: Explaining and Improving the Beam Search Curse
3.1 Motivation
In recent years, neural machine translation (NMT) has surpassed traditional phrase-based or
syntax-based machine translation, becoming the state of the art in MT [2, 9]. While NMT
training is typically done in a “local” fashion which does not employ any search (bar notable
exceptions such as [10], [11], and [12]), the decoding phase of all NMT systems universally
adopts beam search, a widely used heuristic search algorithm, to improve the translation quality.
Unlike phrase-based MT systems which enjoy the benefits of very large beam sizes (in the
order of 100–500) [13], most NMT systems choose tiny beam sizes up to 5; for example, Google’s
GNMT [14] and Facebook’s ConvS2S [6] use beam sizes 3 and 5, respectively. Intuitively, the
larger the beam size, the more candidates it explores, and the better the translation quality should
be. While this definitely holds for phrase-based MT systems, surprisingly, it is not the case for
NMT: many researchers observe that translation quality degrades with beam sizes beyond 5 or
10 [15]. We call this phenomenon the “beam search curse”, which is listed as one of the six
biggest challenges for NMT [15].
However, there has not been enough attention to this problem. [16] hints that length ratio is
the problem, but does not explain why larger beam sizes cause shorter length outputs and worse
BLEU scores. [17] attributes it to two kinds of “uncertainties” in the training data, namely the
copying of source sentence and the non-literal translations. However, the first problem is only
found in European language datasets and the second problem occurs in all datasets but does not
seem to affect pre-neural MT systems. Therefore, their explanations are not satisfactory.
On the other hand, previous work adopts several heuristics to address this problem, but
with various limitations. For example, RNNSearch [2] uses length normalization, which (as we
will show in Sec. 3.5) seems to somewhat alleviate the problem, while being far from perfect.
Meanwhile, [18] and [16] use word-reward, but their reward is a hyper-parameter to be tuned on
the dev set.
Our contributions are as follows:
• We explain why the beam search curse exists, supported by empirical evidence (Sec. 3.2).
8
37 1.00
BLEU
|y|
36 length ratio lr = |y∗ |
0.95
length ratio / brevity penalty

brevity penalty bp = min{e1−1/lr , 1}
35 0.90
BLEU score
34
0.85
33
0.80
32
0.75
31
0.70
0 20 40 60 80 100
beam size
Figure 3.1: As beam size increases beyond 3, BLEU score on the dev set gradually drops. All
terms are calculated by multi-bleu.pl.
• We review existing rescoring methods, and then propose ours to break the beam search
curse (Sec. 3.3). We show that our hyperparameter-free methods outperform the previous
hyperparameter-free method (length normalization) by +2.0 BLEU (Sec. 3.5).
• We also discuss the stopping criteria for our rescoring methods (Sec. 3.4). Experiments
show that with optimal stopping alone, the translation quality of the length normalization
method improves by +0.9 BLEU.
3.2 Beam Search Curse

The most popular translation quality metric, BLEU [19], is defined as:
P4
BLEU = bp · exp(1/4 n=1 log pn ) (3.1)
where bp = min{e1−1/lr , 1} (3.2)
where lr = |y|/|y∗ | (3.3)
9
40
3rd </eos>
35 2nd </eos>
1st </eos>
average </eos> position
30
25
20
15
10
5
0 50 100 150 200 250 300 350

beam size
Figure 3.2: Searching algorithm with larger beams generates </s> earlier. We use the average
first, second, and third </s> positions on the dev set as an example.
Here pn are the n-gram precisions, and |y| and |y∗ | denote the hypothesis and reference lengths,
while bp is the brevity penalty (penalizing short translations) and lr is the length ratio [15, 20],
respectively.
With beam size increasing, |y| decreases, which causes the length ratio to drop, as shown in
Fig. 3.1. Then the brevity penalty term, as a function of the length ratio, decreases even more
severely. Since bp is a crucial factor in BLEU, this explains why the beam search curse happens.1
Indeed, [21] confirm the same phenomenon with METEOR.
The reason why |y| decreases as beam size increases is actually twofold:
1. As beam size increases, the more candidates it could explore. Therefore, it becomes easier
for the search algorithm to find the </s> symbol. Fig. 3.2 shows that the </s> indices
decrease steadily with larger beams.2
1
The length ratio is not just about BLEU: if the hypothesis length is only 75% of reference length, something that
should have been translated must be missing; i.e., bad adequacy.
2
Pre-neural SMT models, being probabilistic, also favor short translations (and derivations), which is addressed
by word (and phrase) reward. The crucial difference between SMT and NMT is that the former stops when covering
the whole input, while the latter stops on emitting </s>.
10
8
10
12
model score
14
16
18
20
18 20 22 24 26 28 30
length
Figure 3.3: Candidate lengths vs. model score. This scatter plot is generated from 242 finished
candidates when translated from one source sequence with beam size 80.
2. Then, as shown in Fig. 3.3, shorter candidates have clear advantages w.r.t. model score.
Hence, as beam size increases, the search algorithm will generate shorter candidates, and then
prefer even shorter ones among them.
3.3 Rescoring Methods

We first review existing methods to counter the length problem and then propose new ones to
address their limitations. In particular, we propose to predict the target length from the source
sentence, in order to choose a hypothesis with a proper length.
3.3.1 Previous Rescoring Methods

RNNSearch [2] first introduces the length normalization method, whose score is simply the
average model score per output word:
Ŝlength_norm (x, y) = S(x, y)/|y| (3.4)

11
This is the most widely used rescoring method since it is hyperparameter-free.

GNMT [14] incorporates length and coverage penalty into the length normalization method,
while also adding two hyperparameters to adjust their influences. (please check out their paper
for exact formulas).
Baidu NMT [18] borrows the Word Reward method from pre-neural MT, which gives a
reward r to every word generated, where r is a hyperparameter tuned on the dev set:
ŜWR (x, y) = S(x, y) + r · |y| (3.5)
Based on the above, [16] proposes a variant called Bounded Word-Reward which only rewards
up to an “optimal” length. This length is calculated using a fixed “generation ratio” gr, which
is the ratio between target and source sequence length, namely the average number of target
words generated for each source word. It gives reward r to each word up to a bounded length
L(x, y) = min{|y|, gr · |x|}:
ŜBWR (x, y) = S(x, y) + r · L(x, y) (3.6)
3.3.2 Rescoring with Length Prediction

To remove the fixed generation ratio gr from Bounded Word-Reward, we use a 2-layer MLP,
which takes the mean of source hidden states as input, to predict the generation ratio gr ∗ (x).
Then we replace the fixed ratio gr with it, and get our predicted length Lpred (x) = gr ∗ (x) · |x|.
3.3.2.1 Bounded Word-Reward

With predicted length, the new predicted bound and final score would be:
L∗ (x, y) = min{|y|, Lpred (x)} (3.7)

ŜBWR∗ (x, y) = S(x, y) + r · L∗ (x, y) (3.8)
While the predicted length is more accurate, there is still a hyperparameter r (word reward), so
we design two methods below to remove it.
12
3.3.2.2 Bounded Adaptive-Reward

We propose Bounded Adaptive-Reward to automatically calculate proper rewards based on the
current beam. With beam size b, the reward for time step t is the average negative log-probability
of the words in the current beam.
Pb
rt = −(1/b) i=1 log p(wordi ) (3.9)
Its score is very similar to (Equation 3.6):
PL∗ (x,y)
ŜAdaR (x, y) = S(x, y) + t=1 rt (3.10)
3.3.2.3 BP-Norm
Inspired by the BLEU score definition, we propose BP-Norm method as follows:
Ŝbp (x, y) = log bp + S(x, y)/|y| (3.11)
where bp is the same brevity penalty term as in (Equation 3.2). Here, we regard our predicted
length as the reference length. The beauty of this method appears when we exponentiate both
sides of Equation 3.11:
1 P|y|

exp(Ŝbp (x, y)) = bp·exp log p(yi |...)
|y| i=1
Q|y| 1/|y|
= bp· i=1 p(yi |...)
which is in the same form as the BLEU score (Equation 3.1).
3.4 Stopping Criteria

Besides rescoring methods, the stopping criteria (when to stop beam search) is also important,
for both efficiency and accuracy.
13
3.4.1 Conventional Stopping Criteria

By default, OpenNMT-py [22] stops when the topmost beam candidate stops, because there will
not be any future candidates with higher model scores. However, this is not the case for other
rescoring methods; e.g., the score of length normalization (Equation 3.4) could still increase.
Another popular stopping criteria, used by RNNSearch [2], stops the beam search when
exactly b finished candidates have been found. Neither method is optimal.
3.4.2 Optimal Stopping Criteria

For Bounded Word-Reward, [16] introduces a provably-optimal stopping criterion that could stop
both early and optimally. We also introduce an optimal stopping criterion for BP-Norm. Each
time we generate a finished candidate, we update our best score Ŝ ⋆ . Then, for the topmost beam
candidate of time step t, we have:
St,0 Lpred St,0

Ŝbp = + min{1 − , 0} ≤ (3.12)
t t R
St,0
where R is the maximum generation length. Since St,0 will drop after time step t, if R ≤
Ŝ ⋆ , we reach optimality. This stopping criterion could also be applied to length normalization
(Equation 3.4).
Meanwhile, for Bounded Adaptive-Reward, we can have a similar optimal stopping criterion:
If the score of topmost beam candidate at time step t > Lpred is lower than Ŝ ⋆ , we reach
optimality.
Proof. The first part of ŜAdaR in (Equation 3.10) will decrease after time step t, while the second
part stays the same when t > Lpred . So the score in the future will monotonically decrease.
3.5 Experiments
Our experiments are on Chinese-to-English translation task, based on the OpenNMT-py code-
base.3 We train our model on 2M sentences and apply BPE [23] on both sides, which reduces
Chinese and English vocabulary sizes down to 18k and 10k respectively. We then exclude pairs
3
https://github.com/OpenNMT/OpenNMT-py
14
dev test
Small beam (b = 14, 15, 16)
BLEU ratio BLEU ratio
Moses (b=70) 30.14 - 29.41 -
Default (b=5) 36.45 0.87 32.88 0.87
Length Norm. 37.73 0.89 34.07 0.89
+ optimal stopping∗ 38.69 0.92 35.00 0.92
[14] α=β=0.3 38.12 0.89 34.26 0.89
Bounded word-r. r=1.3 39.22 0.98 35.76 0.98
with predicted length
Bounded word-r. r=1.4∗ 39.53 0.97 35.81 0.97
Bounded adaptive-reward∗ 39.44 0.98 35.75 0.98
BP-Norm∗ 39.35 0.98 35.84 0.99
Table 3.1: Average BLEU scores and length ratios over small beams (b = 14, 15, 16). ⋆ denotes
our methods.
with more than 50 source or target tokens. We validate on NIST 06 and test on NIST 08 (newswire
portions only for both). We report case-insensitive, 4 reference BLEU scores.
We use 2-layer bidirectional LSTMs for the encoder. We train the model for 15 epochs
and choose the one with the lowest perplexity on the dev set. The batch size is 64; both word
embedding and hidden state sizes are 500, and the dropout rate is 0.3. The total parameter size is
28.5M.
While conducting our experiments, we found and fixed a decoding bug in OpenNMT-py,
which took </s> into the beam, and decoded it for one more step. We also eliminated repeating
phrases at the end of generation, by forcing the decoder not to choose other words whenever </s>
is with the highest probability. These changes improve the BLEU scores by around +0.3 for all
cases.
To compare our baseline model with other results: the default OpenNMT-py model gets
29.08 BLEU with beam size 4, and 32.9 BLEU with beam size 4, while on the same dataset,
Moses [13] gets 30.1 BLEU. When trained on 2.57M sentences pairs, [11] reports that Moses
gets 32.7 BLEU, and RNNSearch gets 30.7 BLEU. Therefore, the default OpenNMT-py model
as a baseline is actually very competitive.
15
dev test
Large beam (b = 39, 40, 41)
BLEU ratio BLEU ratio
Moses (b=70) 30.14 - 29.41 -
Default (b=5) 36.45 0.87 32.88 0.87
Length Norm. 38.15 0.88 34.26 0.88
+ optimal stopping∗ 39.07 0.91 35.14 0.91
[14] α=β=0.3 38.40 0.89 34.41 0.88
Bounded word-r. r=1.3 39.60 0.98 35.98 0.98
with predicted length
Bounded word-r. r=1.4∗ 40.11 0.98 36.13 0.97
Bounded adaptive-reward∗ 40.14 0.98 36.23 0.98
BP-Norm∗ 39.97 0.99 36.22 0.99
Table 3.2: Average BLEU scores and length ratios over large beams (b = 39, 40, 41). ⋆ denotes
our methods.
3.5.1 Parameter Tuning and Results

We compare all rescoring methods mentioned above. For the length normalization method, we
also show its results with optimal stopping.
For the Bounded Word-Reward method with and without our predicted length, we choose the
best r on the dev set separately. The length normalization used by [14] has two hyper-parameters,
namely α for length penalty and β for coverage penalty. We jointly tune them on the dev set and
choose the best config. (α=0.3, β=0.3).
Figure 3.4 shows our results on the dev set. We see that our proposed methods get the best
performance on the dev set, and continue growing as beam size increases. We also observe that
optimal stopping boosts the performance of the length normalization method by around +0.9
BLEU. In our experiments, we regard our predicted length as the maximum generation length in
(Equation 3.12). We further observe from Figure 3.5 that our methods keep the length ratio close
to 1, and greatly improve the translation quality of longer input sentences, which are notoriously
hard for NMT [11].
Table 3.1 and 3.2 collect our results on both dev and test sets. Without loss of generality,
we improve results with both small and large beam sizes, which average over b=14,15,16 and
b=39,40,41, respectively.
16
3.5.2 Discussion
From Tables 3.1 and 3.2, we could observe that with our length prediction model, The bounded
word-reward method gains consistent improvement. On the other hand, results from the length
normalization method show that the optimal stopping technique gains significant improvement
by around +0.9 BLEU. While with both, our proposed methods beat all previous methods, and
gain improvement over hyperparameter-free baseline (i.e. length normalization) by +2.0 BLEU.
Among our proposed methods, Bounded word-reward has the reward r as a hyper-parameter,
while the other two methods get rid of that. Among them, we recommend the BP-Norm method,
because it is the simplest method, and yet works equally well as others.
17
40
39
38
BLEU score
Bounded Adaptive-Reward
37 Bounded Word-R. w/ Pred. Length
BP-Norm
Length Norm. w/ Optimal Stopping
36 Length Norm.
Default
35
34
0 5 10 15 20 25 30 35 40
beam size
1.00
0.98
0.96 Bounded Word-R. w/ Pred. Length
BP-Norm
0.94 Length Norm. w/ Optimal Stopping
length ratio
Length Norm.
Default
0.92
0.90
0.88
0.86
0.84
0 5 10 15 20 25 30 35 40
beam size
Figure 3.4: The BLEU scores and length ratios (lr = |y|/|y∗ |) of various rescoring methods. We
can observe our proposed methods successfully tune the length ratio to nearly 1. Therefore, they
do not suffer from the brevity penalty of BLEU.
18
45
40
35
30
BLEU score
25
20 BP-Norm
15 Bounded Word-R. w/ Pred. Length
Wu et al. (2016)
10 Length Norm.
Default
5
0 20 40 60 80
Input sentence length
1.0
0.9
0.8
Length ratio
0.7
0.6 BP-Norm
Bounded Word-R. w/ Pred. Length
0.5 Wu et al. (2016)
Length Norm.
0.4 Default
0 20 40 60 80
Input sentence length
Figure 3.5: BLEU scores and length ratios on the dev set over various input sentence lengths.
19
Chapter 4: Explaining and Improving the Transformer Decoder
4.1 Motivation
Transformer models have advanced the state-of-the-art on a variety of natural language process-
ing (NLP) tasks, including machine translation [7], natural language inference, semantic role
labeling [24], and language representation [25]. However, so far not much is known about the
internal properties and functionalities they learn to achieve their superior performance, which
poses significant challenges for human understanding of the model and potentially designing
better architectures.
Recent efforts on interpreting Transformer models mainly focus on assessing the encoder
representations [26–28] or interpreting the multi-head self-attention [29–31]. At the same time,
there have been few attempts to interpret the decoder side, which we believe is also of great
interest, and should be taken into account while explaining the encoder-decoder networks. The
reasons are threefold: (a) the decoder takes both source and target as input, and implicitly per-
forms the functionalities of both alignment and language modeling, which are at the core of
machine translation; (b) the encoder and decoder are tightly coupled in that the output of the en-
coder is fed to the decoder and the training signals for the encoder are back-propagated from the
decoder; and (c) recent studies have shown that the boundary between the encoder and decoder is
blurry, since some of the encoder functionalities can be substituted by the decoder cross-attention
modules [32].
In this study, we interpret the Transformer decoder by investigating when and where the
decoder utilizes source or target information across its stacking modules and layers. Without loss
of generality, we focus on the representation evolution1 within a Transformer decoder. To this
end, we introduce a novel sub-layer2 split with respect to their functionalities: Target Exploitation
Module (TEM) for exploiting the representation from translation history, Source Exploitation
Module (SEM) for exploiting the source-side representation, and Information Fusion Module
(IFM) to combine representations from the other two (Section 4.3).
1
By “evolution”, we denote the progressive trend from the first layer till the last.
2
Throughout this paper, we use the terms “sub-layer” and “module” interchangeably.
20
Further, we design a universal probing scheme to quantify the amount of specific information
embedded in network representations. By probing both source and target information from
decoder sub-layers, and by analyzing the alignment error rate (AER) and source coverage rate,
we arrive at the following findings:
• SEM guides the representation evolution within the NMT decoder (Section 4.5.1).
• Higher-layer SEMs accomplish the functionality of word alignment, while lower-layer

ones construct the necessary contexts (Section 4.5.2).
• TEMs are critical to helping SEM build word alignments, while their stacking order is not
essential (Section 4.5.2).
Last but not least, we conduct a fine-grained analysis of the information fusion process within
IFM. Our key contributions in this work are:
1. We introduce a novel sub-layer split of the Transformer decoder with respect to their function-
alities.
2. We introduce a universal probing scheme from which we derive the aforementioned conclusions
about the Transformer decoder.
3. Surprisingly, we find that the de-facto usage of residual FeedForward operations is not efficient,
and could be removed in totality with minimal loss of performance, while significantly boosting
the training and inference speeds.
4.2 Transformer Decoder

NMT models employ an encoder-decoder architecture to accomplish the translation process
in an end-to-end manner. The encoder transforms the source sentence into a sequence of rep-
resentations, and the decoder generates target words by dynamically attending to the source
representations. Typically, this framework can be implemented with a recurrent neural network
(RNN) [2], a convolutional neural network (CNN) [6], or a Transformer [7]. We focus on the
Transformer architecture since it has become the state-of-the-art model for machine translation
tasks, as well as various text understanding [25] and generation [33] tasks.
Specifically, the decoder is composed of a stack of N identical layers, each of which has three
sub-layers, as illustrated in Figure 4.1. A residual connection [34] is employed around each of
21
the three sub-layers, followed by layer normalization [35] (“Add & Norm”). The first sub-layer
is a self-attention module that performs self-attention over the previous decoder layer:
Cnd = L N A T T (Qnd , Knd , Vdn ) + Ln−1

d
where A T T (·) and L N (·) denote the self-attention mechanism and layer normalization. Qnd , Knd ,
and Vdn are query, key and value vectors that are transformed from the (n-1)-th layer representa-
tion Ln−1
d . The second sub-layer performs attention over the output of the encoder representation:
Dnd = L N A T T (Cnd , KN N n
e , Ve ) + Cd
where KN N N
e and Ve are transformed from the top encoder representation Le . The final sub-layer
is a position-wise fully connected feed-forward network with ReLU activations:
Lnd = L N F F N (Dnd ) + Dnd

The top decoder representation LN

d is then used to generate the final prediction.
4.3 Sub-Layer Partition

In this work, we aim to reveal how a Transformer decoder accomplishes the translation process
by utilizing both source and target inputs. To this end, we split each decoder layer into three
modules with respect to their different functionalities over the source or target inputs, as illustrated
in Figure 4.1:
• Target Exploitation Module (TEM) consists of the self-attention operation and a residual con-
nection, which exploits the target-side translation history from previous layer representations.
• Source Exploitation Module (SEM) consists only of the encoder attention, which dynamically
selects relevant source-side information for generation.
• Information Fusion Module (IFM) consists of the rest of the operations, which fuse source and
target information into the final layer representation.
Compared with the standard splits [7], we associate the “Add&Norm” operation after encoder
attention with the IFM, since it starts the process of information fusion by a simple additive
operation. Consequently, the functionalities of the three modules are well-separated.
22
Output
Output
Probabilities
Probabilities
Softmax
Softmax
Add & Norm

Add & Norm
Feed
Forward
Information Fusion Module (IFM) Feed
to combine both sides of information Forward
Add & Norm for prediction.
Add & Norm
N×
Multi-Head N×
Attention
Source Exploitation Module (SEM) Multi-Head
to exploit information from source-side Attention
Add & Norm

Target Exploitation Module (TEM) Add & Norm
Multi-Head
to exploit information from translation Multi-Head
Attention
history Attention
⊕ ☯ Positional
Encoding ⊕ ☯ Positional
Encoding
Target
Target
Embedding
Embedding
Figure 4.1: A sub-layer splitting of Transformer decoder with respect to their functionalities.
4.4 Research Questions

Modern Transformer decoder is implemented as multiple identical layers, in which the source
and target information are exploited and evolved layer-by-layer. One research question arises
naturally:
RQ1. How do source and target information evolve within the decoder layer-by-layer and
module-by-module?
In Section 4.5.1, we introduce a universal probing scheme to quantify the amount of information
23
embedded in decoder modules and explore their evolutionary trends. The general trend we find
is that higher layers contain more source and target information, while the sub-layers behave
differently. Specifically, the amount of information contained by SEMs would first increase and
then decrease. In addition, we establish that SEM guides both source and target information
evolution within the decoder.
Since SEMs are critical to the decoder representation evolution, we conduct a more detailed
study into the internal behaviors of the SEMs. The exploitation of source information is also
closely related to the inadequate translation problem – a key weakness of NMT models [36]. We
try to answer the following research question:
RQ2. How does SEM exploit the source information in different layers?
In Section 4.5.2, we investigate how the SEMs transform the source information to the target side
in terms of alignment accuracy and coverage ratio [36]. Experimental results show that higher
layers of SEM modules accomplish word alignment, while lower layer ones exploit necessary
contexts. This also explains the representation evolution of source information: lower layers
collect more source information to obtain a global view of source input, and higher layers extract
less aligned source input for accurate translation.
Of the three sub-layers, IFM modules conceptually appear to play a key role in merging
source and target information – raising our final question:
RQ3. How does IFM fuse source and target information on the operation level?
In Section 4.5.3, we first conduct a fine-grained analysis of the IFM module on the operation
level, and find that a simple “Add&Norm” operation performs just as well at fusing information.
Thus, we simplify the IFM module to be only one Add&Norm operation. Surprisingly, this
performs similarly to the full model while significantly reducing the number of parameters and
consequently boosting both training and inference speed.
4.5 Experiments
Data To make our conclusions compelling, all experiments and analyses are conducted on three
representative language pairs. For English⇒German (En⇒De), we use the WMT14 dataset that
24
consists of 4.5M sentence pairs. The English⇒Chinese (En⇒Zh) task is conducted on the
WMT17 corpus, consisting of 20.6M sentence pairs. For English⇒French (En⇒Fr) task, we use
the WMT14 dataset that comprises 35.5M sentence pairs. English and French have many aspects
in common while English and German differ in word order, requiring a significant amount of
reordering in translation. Besides, Chinese belongs to a different language family compared to
the others.
Models We conducted the experiments on the state-of-the-art Transformer [7], and imple-
mented our approach with the open-source toolkit FairSeq [37]. We follow the setting of
Transformer-Base in [7], which consists of 6 stacked encoder/decoder layers with the model size
being 512. We train our models on 8 NVIDIA P40 GPUs, where each is allocated with a batch
size of 4,096 tokens. We use Adam optimizer [38] with 4,000 warm-up steps.
Training and Evaluation All transformer models are selected based on their loss on the vali-
dation set, while evaluated and reported on the test set. For En-De and En-Fr models, we used
newstest2013 as the validation set and newstest2014 as the test set. For En-Zh models, we used
newsdev2016 as the validation set and newstest2017 as the test set.
All three datasets follow the prepossessing steps from FairSeq3 , which uses Moses tokenizer4 ,
with a joint BPE of 40000 steps, while does not include lower-casing nor true-casing.
All models are evaluated with a beam size of 10. Before evaluating the BLEU score, we apply
a postprocessing step, where En-De and En-Fr generations apply compound word splitting5 , and
En-Zh generations apply Chinese word splitting (into Chinese characters). All generations are
then evaluated with Moses multi-bleu.perl script6 against the golden references.
4.5.1 Representation Evolution Across Layers

In order to quantify and visualize the representation evolution, we design a universal probing
scheme to quantify the source (or target) information stored in network representations.
3
https://github.com/pytorch/fairseq/blob/master/examples/translation/
prepare-wmt14en2de.sh
4
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/
mosestokenizer/tokenizer.py
5
https://gist.github.com/myleott/da0ea3ce8ee7582b034b9711698d5c16
6
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/
multi-bleu.perl
25
Output 0.3 0.2 0.2 0.5 0.4 0.1 0.3
Input 2 Bush hold a talk with Sharon .
force-decoding
Probing Model
Input 1
Figure 4.2: Illustration of the information probing model, which reads the representation of
a decoder module (“Input 1”) and the word sequence to recover (“Input 2”), and outputs the
generation probability (“Output”).
Task Description Intuitively, the more the source (or target) information stored in a network
representation, the more probably a trained reconstructor could recover the source (or target)
sequence. Since the lengths of source sequence and decoder representations are not necessarily
the same, the widely-used classification-based probing approaches [39, 40] cannot be applied to
this task. Accordingly, we cast this task as a generation problem – evaluating the likelihood of
generating the word sequence conditioned on the input representation.
Figure 4.2 illustrates the architecture of our probing scheme. Given a representation sequence
from decoder H = {h1 , . . . , hM } and the source (or target) word sequence to be recovered
x = {x1 , . . . , xN } the recovery likelihood is calculated as the perplexity (i.e. negative log-
likelihood) of forced-decoding the word sequence:
N
X
P P L(x|H) = − log P (xn |x<n , H) (4.1)
n=1
The lower the recovery perplexity, the more the source (or target) information stored in the
representation. The probing model can be implemented as any architecture. For simplicity, we
use a one-layer Transformer decoder. We train the probing model to recover both source and
target sequence from all decoder sub-layer representations. During training, we fix the NMT
model parameters and train the probing model on the MT training set to minimize the recovery
perplexity in Equation 4.1.
26
Task Discussion The above probing scheme is a general framework applicable to probing any
given sequence from a network representation. When we probe for the source sequence, the
probing model is analogous to an auto-encoder [41], which reconstructs the original input from
the network representations. When we probe for the target sequence, we apply an attention mask
to the probing decoder to avoid direct copying from the input of translation histories. Contrary
to source probing, the target sequence is never seen by the model.
In addition, our proposed scheme can also be applied to probe linguistic properties that can
be represented in a sequential format. For instance, we could probe source constituency parsing
information, by training a probing model to recover the linearized parsing sequence [42].
Probing Results Figure 4.3 shows the results of our information probing conducted on the
held-out set. We have a few observations:
• The evolution trends of TEM and IFM are largely the same. Specifically, the curve of TEM
is very close to that of IFM shifted up by one layer. Since TEM representations are two
operations (self-attn. and Add&Norm) away from the previous layer IFM, this observation
indicates that TEMs do not significantly affect the amount of source/target information. 7
• SEM guides both source and target information evolution. While closely observing the
curves, the trend of layer representations (i.e. IFM) is always led by that of SEM. For
example, as the PPL of SEM transitions from decreases to increases, the PPL of IFM slows
down the decreases and starts increasing as aftermath. This is intuitive: in machine trans-
lation, source and target sequences should contain equivalent information, thus the target
generation should largely follow the lead of source information (from SEM representations)
to guarantee its adequacy.
• For IFM, the amount of target information consistently increases in higher layers – a
consistent decrease of PPL in Figures 4.3(d-f). While source information goes up in the
lower layers, it drops in the highest layer (Figures 4.3(a-c)).
Since SEM representations are critical to decoder evolution, we turn to investigate how SEM
exploits source information, in the hope of explaining the decoder information evolution.
7
TEM may change the order or distribution of source/target information, which is not captured by our probing
experiments.
27
4.5.2 Exploitation of Source Information

Ideally, SEM should accurately and fully incorporate the source information for the decoder.
Accordingly, we evaluate how well SEMs accomplish the expected functionality from two per-
spectives.
Word Alignment. Previous studies generally interpret the attention weights of SEM as word
alignments between source and target words, which can measure whether SEMs select the most
relevant part of source information for each target token [32, 36]. We follow the previous practice
to merge attention weights from the SEM attention heads and extract word alignments by selecting
the source word with the highest attention weight for each target word. We calculate the alignment
error rate (AER) scores [43] for word alignments extracted from SEM of each decoder layer.
Cumulative Coverage. Coverage is commonly used to evaluate whether the source words are
fully translated [36]. We use the above extracted word alignments to identify the set of source
words Ai , which are covered (i.e., aligned to at least one target word) at each layer. We then
propose a new metric cumulative coverage ratio C≤i to indicate how many source words are
covered by the layers ≤ i:
|A1 ∪ · · · ∪ Ai |
C≤i = (4.2)
N
where N is the number of total source words. This metric indicates the completeness of source
information coverage till layer i.
Dataset We conducted experiments on two manually-labeled alignment datasets: RWTH En-

De8 and En-Zh [44]. The alignments are extracted from NMT models trained on the WMT En-De
and En-Zh datasets.
Results Figure 4.4 demonstrates our results on word alignment and cumulative coverage. We
find that the lower-layer SEMs focus on gathering source contexts (a rapid increase of cumulative
coverage with poor word alignment), while higher-layer ones play the role of word alignment
with the lowest AER score of less than 0.4 at the 5th layer. The 4th layer and the 3rd layer
separate the two roles for En-De and En-Zh respectively. Correspondingly, they are also the
turning points (PPL from decreases to increases) of source information evolution in Figure 4.3
8
https://www-i6.informatik.rwth-aachen.de/goldAlignment
28
(a,b). Together with conclusions from Sec. 4.5.1, we demonstrate the general pattern of SEM:
SEM tends to cover more source content and gain increasing amount of source information up to
a turning point of 3rd or 4th layer, after which it starts only attending to the most relevant source
tokens and contains decreasing amount of total source information.
Decoder En-De En-Zh En-Fr

T E M⇒S E M⇒I F M 27.45 32.24 40.39
S E M⇒T E M⇒I F M 27.61 33.62 40.89
S E M⇒I F M 22.76 30.06 37.56
Table 4.1: Effects of the stacking order of decoder sublayers on translation quality in terms of
BLEU score.
TEM Modules Since TEM representations serve as the query vector for encoder attention op-
erations (shown in Figure 4.1), we naturally hypothesize that TEM is helping SEM in building
alignments. To verify that, we remove TEM from the decoder (“SEM⇒IFM”), which signif-
icantly increases the alignment error from 0.37 to 0.54 (in Figure 4.5), and leads to a serious
decrease in translation performance (BLEU: 27.45 ⇒ 22.76, in Table 4.1) on En-De, while
results on En-Zh also confirms it (in Figure 4.6). This indicates that TEM is essential for building
word alignment.
However, reordering the stacking of TEM and SEM (“SEM⇒TEM⇒IFM”) does not affect
the alignment or translation qualities (BLEU: 27.45 vs. 27.61). These results provide empirical
support for recent work on merging TEM and SEM modules [45].
Depth En-De En-Zh En-Fr Ave.

6 27.45 32.24 40.39 33.36
4 27.52 31.35 40.37 33.08
12 27.64 32.50 40.44 33.53
Table 4.2: Effects of various decoder depths on translation quality in terms of BLEU score.
Robustness to Decoder Depth To verify the robustness of our conclusions, we vary the depth
of the NMT decoder and train it from scratch. Table 4.2 demonstrates the results on translation
quality, which generally show that more decoder layers bring better performance. Figure 4.7
29
shows that SEM behaves similarly regardless of depth. These results demonstrate the robustness
of our conclusions.
4.5.3 Information Fusion in Decoder

We now turn to the analysis of IFM. Within the Transformer decoder, IFM plays a critical role
in fusing the source and target information by merging representations from SEM and TEM. To
study the information fusion process, we conduct a more fine-grained analysis of IFM at the
operation level.
Fine-Grained Analysis on IFM As shown in Figure 4.8(a), IFM contains three operations:
• Add-NormI linearly sums and normalizes the representations from SEM and TEM;
• Feed-Forward non-linearly transforms the fused source and target representations;
• Add-NormII again linearly sums and normalizes the representations from the above two.
IFM Analysis Results Figures 4.8 (b) and (c) respectively illustrate the source and target
information evolution within IFM. Surprisingly, Add-NormI contains a similar amount of, if not
more, source (and target) information than Add-NormII , while the Feed-Forward curve deviates
significantly from both. This indicates that the residual Feed-Forward operation may not affect
the source (and target) information evolution, and one Add&Norm operation may be sufficient
for information fusion.
Model Self-Attn. Enc-Attn. FFN
Base 6.3M 6.3M 12.6M
Big 25.2M 25.2M 50.4M
Table 4.3: The number of parameters taken by three major operations within Transformer Base
and Big decoder.
Simplified Decoder To empirically demonstrate whether one Add&Norm operation is already

sufficient, we remove all other operations, leaving just one Add&Norm operation for the IFM.
The architectural change is illustrated in Figure 4.9(b), and we dub it the “simplified decoder”.
30
Decoder BLEU #Train. #Infer.

Standard 27.45 63.93K 65.35
En-De
Simplified 27.29 71.08K 72.93
△ -0.16 +11.18% +11.60%
Standard 32.24 32.49K 38.55
En-Zh
Simplified 33.15 36.59K 54.06
△ +0.91 +12.62% +40.23%
En-Fr Standard 40.39 68.28K 58.97
Simplified 40.07 76.03K 67.23
△ -0.32 +11.35% +14.01%
Table 4.4: Performance of the simplified Base decoder. “#Train” denotes the training speed
(words per second) and “#Infer.” denotes the inference speed (sentences per second). Results are
averages of three runs.
Model Fluency Adequacy

Standard (Base) 4.00 3.86
Simplified (Base) 4.01 3.87
Table 4.5: Human evaluation of translation performance of both standard and simplified decoders
on 100 samples from the En-Zh test set, on a scale of 1 to 5.
Simplified Decoder Results Table 4.4 reports the translation performance of both architectures
on all three major datasets, while Figure 4.10 illustrates the information evolution of both on
WMT En-De. We find the simplified model reaches comparable performance with only a minimal
drop of 0.1-0.3 BLEU on En-De and En-Fr, while observing 0.9 BLEU gains on En-Zh.9 To
further assess the translation performance, we manually evaluate 100 translations sampled from
the En-Zh test set. On a scale of 1 to 5, we find that the simplified decoder obtains a fluency score
of 4.01 and an adequacy score of 3.87, which is approximately equivalent to that of the standard
decoder, i.e. 4.00 for fluency and 3.86 for adequacy (in Table 4.5).
On the other hand, since the simplified decoder drops the operations (FeedForward) with
most parameters (shown in Table 4.3), we also expect a significant increase in both training and
inference speeds. From Table 4.4, we confirm a consistent boost of both training and inference
8
As a comparison, the total number of parameters in Base and Big models are 62.9M and 213.9M respectively on
En-De.
9
Simplified models are trained with the same hyper-parameters as standard ones, which may be suboptimal as the
number of parameters is significantly reduced.
31
speeds by approximately 11-14%. To demonstrate the robustness, we also confirm our findings
with the Transformer big settings [7], whose results are shown in Section 4.6. The lower PPL in
Figure 4.10 suggests that the simplified model also contains consistently more source and target
information across its stacking layers.
Our results demonstrate that a single Add&Norm is indeed sufficient for IFM, and the sim-
plified model reaches comparable performance with a significant parameter reduction and a
noticeable 11-14% boost in training and inference speed.
4.6 Transformer Big Results

We also compare the performance of the standard and simplified decoder with the Transformer
Big setting. Big models are trained on 4 NVIDIA V100 chips, where each is allocated with
a batch size of 8,192 tokens. Other training schedules and hyper-parameters are the same as
standard [7]. Also, our Transformer Base models are all trained with full precision (FP32), while
Big models are all trained with half-precision (FP16) for faster training.
Transformer Big results are shown in Table. 4.6. We could observe a more severe BLEU
score drop with a more significant speed boosting with the Big setting. This is very intuitive,
compared to the Base setting, the simplified decoder drops more parameters, while still trained
with the same schedule as standard, thus escalating the training discrepancy. Unfortunately due
to the resource limitation, we could not afford hyper-parameter tuning for Transformer.
Decoder BLEU #Train. #Infer.

Standard 28.66 103.7K 74.3
En-De
Simplified 28.20 125.2K 90.5

△ -0.46 +20.7% +21.8%
Standard 34.48 71.3K 30.5
En-Zh
Simplified 34.35 82.6K 46.0

△ -0.13 +15.8% +50.8%
Standard 42.48 113.8K 65.7
En-Fr
Simplified 42.19 138.1K 80.9

△ -0.29 +21.4% +23.1%
Table 4.6: Performance of the simplified Big decoder. “#Train” denotes the training speed (words
per second) and “#Infer.” denotes the inference speed (sentences per second).
32
4.7 Related Work

Interpreting Encoder Representations Previous studies generally focus on interpreting the
encoder representations by evaluating how informative they are for various linguistic tasks [40,
46], for both RNN models [39, 47–49] and Transformer models [26–28, 50]. Although they
found that a certain amount of linguistic information is captured by encoder representations, it is
still unclear how much encoded information is used by the decoder. Our work bridges this gap
by interpreting how the Transformer decoder exploits the encoded information.
Interpreting Encoder Self-Attention In recent years, there has been a growing interest in inter-
preting the behaviors of attention modules. Previous studies generally focus on the self-attention
in the encoder, which is implemented as multi-head attention. For example, [29] showed that
different attention heads in the encoder-side self-attention generally attend to the same position.
[30] and [31] found that only a few attention heads play consistent and often linguistically in-
terpretable roles, and others can be pruned. In this work, we investigated the functionalities of
decoder-side attention modules for exploiting both source and target information.
Interpreting Encoder Attention The encoder-attention weights are generally employed to

interpret the output predictions of NMT models. Recently, [51] showed that attention weights
are weakly correlated with the contribution of source words to the prediction. Related to our
work, [32] also conducted word alignment analysis on the same De-En and Zh-En datasets with
Transformer models10 . We use similar techniques to examine word alignment in our context; how-
ever, we also introduce a forced-decoding-based probing task to closely examine the information
flow.
Understanding and Improving NMT Recent work started to improve NMT based on the find-
ings of interpretation. For instance, [39, 52] pointed out that different layers prioritize different
linguistic types. [53] explained why the decoder learns considerably less morphology than the
encoder, and then explored to explicitly inject morphology in the decoder. [54] argued that the
need to represent and propagate lexical features in each layer limits the model’s capacity, and in-
troduced gated shortcut connections between the embedding layer and each subsequent layer. In
10
We find our results are more similar to that of [32]. Also, our results are reported on the En⇒De and En⇒Zh
directions, while they report results in the inverse directions.
33
this work, based on our information probing analysis, we simplified the decoder by removing the
residual feedforward module in totality, with minimal loss of translation quality and a significant
boost in both training and inference speeds.
34
6 66 1 1 12 2 23 3 34 4 45 5 56 6 6 1 1 12 2 23 3 34 4 45 5 56 6 6
r Decoder
DecoderLayer
DecoderLayer
Layer Decoder
DecoderLayer
DecoderLayer
Layer
4.5
4.5
4.5 IFM
IFM
IFM
SEM
SEM
SEM
4.0
4.0
4.0 TEM
TEM
TEM
Source PPL
Source PPL
Source PPL
3.5
3.5
3.5
3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
2.0
1 1 12 2 23 3 34 4 45 5 56 6 61 1 12 2 23 3 34 4 45 5 56 6 61 1 12 2 23 3 34 4 45 5 56 6 6
Decoder
DecoderLayer
DecoderLayer
Layer Decoder
DecoderLayer
DecoderLayer
Layer Decoder
DecoderLayer
DecoderLayer
Layer
(a) En-De (b) En-Zh (c) En-Fr
5.5 5.5 5.5 IFM IFM IFM
1.0
1.0
SEMSEM SEM 0.9
5.0 5.0 5.0 TEMTEM TEM 0.9
0.8
4.5 4.5 4.5 0.8
Target PPL
Target PPL
Target PPL
0.7
AER
4.0 4.0 4.0 0.7
AER
0.6
0.6
3.5 3.5 3.5
0.5
0.5
3.0 3.0 3.0 0.4 Standard
0.4 Sta
Simplified
2.5 2.5 2.5 0.3 Sim
1 21 321 432 543 654 65 1 6 21 32 43 54 165 26 13 241 352 463 54 65 6 0.3
1 2
Decoder Decoder
Layer
Decoder LayerLayer Decoder Layer
Decoder LayerDecoderDecoder
Layer Layer
Decoder Layer
Deco
(d) En-De (e) En-Zh 4.5 4.5 (f) En-Fr IFM IFM
SEMSEM
Figure 4.3: Evolution trends of source (upper panel) and target (bottom panel) information em-
4.0 4.0 TEMTEM
English-Chinese
English-Chinese
English-German
English-German
bedded in the decoder modular representations across layers. Lower perplexity (“PPL”) denotes
English-French
English-French
Source PPL
Source PPL
more information embedded in the representations.

3.5 3.5
3.0 3.0
2.5 2.5
2.0 2.0
1 21 32 43 54 65 16 21 32 43
Decoder Layer
Decoder Layer Decoder La
Decode
35
0.9 0.9 1.0 1.0
0.8 0.8 0.9 0.9
Coverage Ratio
Coverage Ratio
0.7 0.7
0.8 0.8
AER
AER
0.6 0.6
0.7 0.7
0.5 0.5
0.4 0.4 0.6 0.6

En-De
En-De En-De
En-De
En-Zh
En-Zh En-Zh
En-Zh
0.3 0.3 0.5 0.5
1 12 23 34 45 56 6 1 12 23 34 45 56 6
Decoder
Decoder
Layer
Layer Decoder
Decoder
Layer
Layer The b
(a) Word Alignment (b) Cumulative Coverage
Figure 4.4: Behavior of the SEM in terms of (a) alignment quality measured in AER (the lower,
the better), and (b) the cumulative coverage of source words.
1.0 1.0 1.0 1.0 1.0 1.0

0.9 0.9
0.9 0.9 0.9 0.9
Coverage Ratio
Coverage Ratio
Coverage Ratio
Coverage Ratio
0.8 0.8
0.8 0.8 0.7 0.7 0.8 0.8
AER
AER
0.7 0.7 0.6 0.6 0.7 0.7

0.5 0.5
0.6 0.6 0.6 0.6 TEM=>SEM=>IFM
TEM=>SEM=>IFM
0.4 0.4 SEM=>TEM=>IFM
SEM=>TEM=>IFM
SEM=>IFM
SEM=>IFM
0.5 0.5 0.3 0.3 0.5 0.5
56 6 1 12 23 34 45 56 6 1 12 23 34 45 56 6 1 12 23 34 45 56 6
er Decoder
Decoder
Layer
Layer Decoder
Decoder
Layer
Layer Decoder
Decoder
Layer
Layer
1.0 1.0 1.0
(a) 1.0
Alignment 1.0(b)
1.0Cumulative Coverage
Figure 4.5: 0.9 0.9of the stacking order of TEM and SEM on the En-De dataset.
Effects
0.9 0.9 0.9 0.9
Coverage Ratio
Coverage Ratio
Coverage Ratio
Coverage Ratio
0.8 0.8
0.8 0.8 0.7 0.7 0.8 0.8
AER
AER
0.7 0.7 0.6 0.6 0.7 0.7

0.5 0.5
0.6 0.6 0.6 0.6 TEM=>SEM=>IFM
TEM=>SEM=>IFM
d 0.4 0.4 SEM=>TEM=>IFM
SEM=>TEM=>IFM
ed SEM=>IFM
SEM=>IFM
0.7
Coverag
Coverag
AE
0.7 0.6 0.7
0.5
0.6 0.6 TEM=>SEM=>IFM
0.4 SEM=>TEM=>IFM
SEM=>IFM
0.5 0.3 0.5
6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 36
Decoder Layer Decoder Layer Decoder Layer
1.0 1.0 1.0
0.9
0.9 0.9
Coverage Ratio
Coverage Ratio
0.8
0.8 0.7 0.8
AER
0.7 0.6 0.7

0.5
0.6 0.6 TEM=>SEM=>IFM
0.4 SEM=>TEM=>IFM
SEM=>IFM
0.5 0.3 0.5
6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
r Decoder Layer Decoder Layer Decoder Layer
(a) Alignment (b) Cumulative Coverage
Figure 4.6: Effects of the stacking order of TEM and SEM on the En-Zh dataset.
1.0 1.0 1.0 1.0

0.9 0.9 0.9 0.9
0.8 0.8
Coverage Ratio
Coverage Ratio
0.8 0.8
0.7 0.7
AER
AER
0.7 0.7
0.6 0.6
0.6 0.6
0.5 0.5
6 Layer 6 Layer
0.4 0.4 0.5 0.5
4 Layer 4 Layer
12 Layer 12 Layer
0.3 0.3 0.4 0.4
2 4 6 8 10 12
2 4 6 8 10 12 2 4 6 8 10 12
2 4 6 8 10 12
Decoder Layer Decoder Layer Decoder Layer Decoder Layer
1.0 (a)1.0
Word Alignment 1.0 1.0
(b) Cumulative Coverage
0.9 0.9 of decoder depths on SEM behaviors on the En-De task.

Figure 4.7: Effects
0.8 0.8
Coverage Ratio
Coverage Ratio
0.9 0.9
0.7 0.7
AER
AER
0.6 0.6
0.8 0.8
0.5 0.5
6 Layer 6 Layer
0.4 0.4 4 Layer 4 Layer
12 Layer 12 Layer
37
(a) Three Operations in IFM
(b) Source PPL (c) Target PPL
Figure 4.8: Illustration of (a) three operations within IFM, and (b,c) the source and target infor-
mation evolution within IFM on En-De task.
38
Output Output
Probabilities
Probabilities
Softmax
Softmax
Output Output
Probabilities
Probabilities
Add & Add
Norm& Norm
Feed Feed Softmax
Softmax
Forward
Forward
Add & Add

Norm& Norm Add & Add
Norm& Norm
N× N× N× N×
Multi-Head
Multi-Head Multi-Head
Multi-Head
Attention
Attention Attention
Attention
Add & Add

Norm& Norm Add & Add
Norm& Norm
Multi-Head
Multi-Head Multi-Head
Multi-Head
Attention
Attention Attention
Attention
⊕ ☯⊕Positional
☯ Positional
Encoding
Encoding ⊕ ☯⊕Positional
☯ Positional
Encoding
Encoding
Target Target Target Target
Embedding
Embedding Embedding
Embedding
(a) Standard (b) Simplified
Figure 4.9: Illustration of (a) the standard decoder and (b) the simplified decoder.
39
.5 5.5 4.5 4.5IFM --standard IFM 5.5 4.5

IFM IFM --stan
.0 SEM IFM --simplified SEM 5.0 IFM --sim
5.0 TEM 4.0 4.0 TEM 4.0
.5 4.5
Source PPL
Source PPL
Target PPL
4.5
Source PPL
Target PPL
3.5 3.5 3.5

.0 4.0 4.0
3.0 3.0 3.0
.5 3.5 3.5
.0 2.5 2.5 3.0 2.5

3.0
.5 2.5 2.0 2.0 2.5 2.0
1 2132435465 6 1 2 13 24 35 46 5 6 1 2 3 4 5 6 1 2 3 4
Decoder LayerLayer
Decoder DecoderDecoder
Layer Layer Decoder Layer Decoder Lay
5.5 IFM
4.5 PPL
(a) Source IFM (b) Target PPL
SEM SEM
5.0 Figure 4.10: Comparison of IFM TEM
TEM 4.0 information evolution between the standard and simplified
decoder on En-De.
4.5
Source PPL
Target PPL
3.5
4.0
3.0
3.5
3.0 2.5
2.5 2.0
1 2 3 4 5 6 1 2 3 4 5 6
Decoder Layer Decoder Layer
40
Chapter 5: Explaining and Improving Multilingual Off-Target Translation
5.1 Motivation
With Neural Machine Translation becoming the state-of-the-art approach in bilingual machine
translations [2, 7], Multilingual NMT systems have increasingly gained attention due to their
deployment efficiency. One conceptually attractive advantage of Multilingual NMT [5] is its
capability to translate between multiple source and target languages with only one model, where
many directions1 are trained in a zero-shot manner.
Despite its theoretical benefits, Multilingual NMT often suffers from target language in-
terference [5, 55]. Specifically, [5] found that Multilingual NMT often improves performance
compared to bilingual models in many-to-one setting (translating other languages into English),
yet often hurts performance in one-to-many setting (translating English into other languages).
Several other works [8, 56, 57] also confirm one-to-many translation to be more challenging than
many-to-one. Another widely observed phenomenon is that the current multilingual system on
zero-shot translations faces a serious off-target translation issue [58, 59] where the generated
target text is not in the intended language. For example, Table 5.1 shows the percentage of
off-target translations appearing between high-resource languages. These issues exemplify the
internal failure of multilingual systems to model different target languages. This paper focuses
on reducing off-target translation, since it has the potential to improve the quality of zero-shot
translation as well as general translation accuracy.
WMT Fr-De De-Fr Cs-De De-Cs

Baseline 51.60% 39.80% 13.10% 20.50%
OPUS Fr-De De-Fr Fr-Ru Ru-Fr
Baseline 95.15% 93.70% 68.85% 91.20%
Table 5.1: Off-target translation percentages on WMT and OPUS test set.
Previous work on reducing the off-target issue often resorts to back-translation (BT) tech-
niques [60]. [58] employs a pretrained NMT model to generate BT parallel data for all O(N 2 )
1
Most NMT datasets are English-centric, meaning that all training pairs include English either as source or target.
41
English-free2 directions and trains the multilingual systems on both real and synthetic data. [59]
instead fine-tune the pretrained multilingual system on BT data that are randomly generated on-
line for all zero-shot directions. However, leveraging BT data for zero-shot directions has some
weaknesses:
• The need for BT data grows quadratically with the number of languages involved, requiring
significant time and computing resources to generate the synthetic data.
• Training the multilingual systems on noisy BT data would usually hurt the English-centric
performance [59].
In this work, we propose a joint representation-level and gradient-level regularization to directly
address the multilingual system’s limitation of modeling different target languages. At the rep-
resentation level, we regulate the NMT decoder states by adding an auxiliary Target Language
Prediction (TLP) loss, such that decoder outputs are retained with target language information.
At the gradient level, we leverage a small amount of direct data (in thousands of sentence pairs)
to project the model gradients for each target language (TGP for Target-Gradient-Projection).
We evaluate our methods on two large-scale datasets, one concatenated from previous WMT
competitions with 10 languages, and the OPUS-100 from [59] with 95 languages. Our results
demonstrate the effectiveness of our approaches in all language pairs, with an average of +5.59
and +10.38 BLEU gain across zero-shot pairs, and 24.5% → 0.9% and 65.8% → 4.7% reduction
to off-target rates on WMT-10 and OPUS-100 respectively. Moreover, we show the off-target
translation not only appears in the zero-shot directions, but also exists in the English-centric pairs.
5.2 Approach
In this section, we will illustrate the baseline multilingual models and our proposed joint repre-
sentation and gradient regularizations.
5.2.1 Representation-Level Regularization: Target Language Prediction

(TLP)
As shown in Table 5.1, the current multilingual baseline faces a serious off-target translation
issue across the zero-shot directions. With the multilingual decoder generating tokens in a wrong
2
We denote translation directions that do not involve English as English-free directions.
42
language, its decoder states for different target languages are also mixed and not well separated,
in spite of the input token ⟨j⟩. We thus introduce a representation-level regularization by adding
an auxiliary Target Language Prediction (TLP) task to the standard NMT training.
Specifically, given the source sentence x = (x1 , x2 , ..., x|x| ) and a desired target language
⟨j⟩, the model generates a sequence of decoder states z = (z1 , z2 , ..., z|ŷ| )3 . As the system feeds
z through a classifier and predicts tokens (in Equation 2.6), we feed z through a LangID model
to classify the desired target language ⟨j⟩. TLP is then optimized with the cross-entropy loss:
LTLP = − log Mθ (z, ⟨j⟩), (5.1)
where the LangID model Mθ is parameterized as a 2-layer Transformer encoder with a LangID
classifier on the top. TLP loss is then linearly combined with LNMT with a coefficient α. Empiri-
cally, we found that any value from {0.1, 0.2, 0.3} for α performs similarly well.
L = (1 − α) · LNMT + α · LTLP . (5.2)
Implementation We implement the LangID model as a 2-layer Transformer encoder with

input from the multilingual decoder states to classify the target language. We add to the decoder
states a sinusoidal positional embedding for position information. We implement two common
approaches to do classification: CLS_Token and Meanpooling. For CLS_Token, we employ a
BERT-like [25] CLS token and feed its topmost states to the classifier. For Meanpooling, we
simply take the mean of all output states and feed it to the classifier. Their comparison is shown
in Section 5.4.1.
5.2.2 Gradient-Level Regularization: Target-Gradient-Projection (TGP)

Although the TLP loss helps build more separable decoder states, it lacks reference signals to
directly guide the system on how to model different target languages. Inspired by recent gradient-
alignment-based methods [61–64], we propose Target-Gradient-Projection (TGP) to guide the
model training with constructed oracle data, where we project the training gradient to not conflict
with the oracle gradient.
3
z is of the same length as the system translation ŷ.
43
Algorithm 1: Target-Gradient-Projection
Input :Involved language set L; Pre-trained model θ; Training data Dtrain ; Oracle data
Doracle ; Update frequency n
1 Initialize step t = 0, θ0 = θ
2 while not converged do
▷ Update oracle data gradients
3 if t (mod n) = 0 then
4 for i in L do
5 i
goracle = Bi ∼Di ∇θt L(θt , B i )
P
oracle
6 end
7 end
8 Sample minibatches grouped by target language B = {Ti }
9 for i in L do
10 i
gtrain = ∇θt L(θt , T i )
11 i i
if goracle · gtrain < 0 then
i ·g i
gtrain
12 i
gtrain i
= gtrain − oracle i
goracle
i
∥goracle ∥2
13 end
14 end
15 Update t ← t + 1
P i
16 Update θt with gradient i gtrain
17 end
Creation of oracle data Similar to [61, 64], we build the oracle data from a multilingual dev
set, since the dev set is often available and is of a higher quality than the training set. More
importantly, for some zero-shot pairs, we are able to include hundreds or thousands of parallel
samples from the dev set. We construct the oracle data by concatenating all available dev sets
and grouping them by the target language. For example, the oracle data for French would include
every other language to French. The detailed construction of oracle data is specific to each dataset
and described in Section 5.3.4. The dev set often serves to select the best checkpoint for training,
thus we split the dev set as 80% for oracle data and 20% for checkpoint selection.
Implementation Contrary to standard multilingual training, where a training batch consists of

parallel data from different language pairs, we group the training data by the target language after
the temperature-based sampling. By doing so, we treat the multilingual system as a multi-task
learner, and translations into different languages are regarded as different tasks. Similarly, we
44
construct the oracle data individually for each target language, whose gradients would serve as
guidance to the training gradients.
i
For each step, we obtain the training gradients gtrain for target language i, and the gradients
i
of the corresponding oracle data goracle i
. Whenever we observe a conflict between gtrain and
i
goracle i
, which is defined as a negative cosine similarity, we project gtrain into the normal plane of
i
goracle to de-conflict [62].
i
gtrain i
· goracle
i i i
gtrain = gtrain − i goracle . (5.3)
∥goracle ∥2
The detailed algorithm is illustrated in Algorithm 1. We train the multilingual system with NMT
loss or NMT+TLP joint loss for 40k steps before starting the TGP training, since the gradients
of oracle data are not stable when trained from scratch. We set the update frequency n = 200
for all our experiments. In our experiments, TGP is approximately 1.5x slower than the standard
NMT training.
5.3 Experimental Setup
5.3.1 Datasets: WMT-10

Following [65], we collect parallel data from publicly available WMT campaigns4 to form an
English-centric multilingual WMT-10 dataset, including English and 10 other languages: French
(Fr), Czech (Cs), German (De), Finnish (Fi), Latvian (Lv), Estonian (Et), Romanian (Ro), Hindi
(Hi), Turkish (Tr) and Gujarati (Gu). The size of bilingual datasets ranges from 0.08M to 10M,
with five language pairs above 1M (Fr, Cs, De, Fi, Lv) and five language pairs below 1M (Et, Ro,
Hi, Tr, Gu). We use the same dev and test set as [65]. Since the WMT data does not include
zero-shot dev or test set, we have created 1k multi-way aligned dev and test sets for all involved
languages based on the WMT2019 test set. For evaluation, we picked 6 language pairs (12
translation directions) to examine zero-shot performance, including pairs of both high-resource
languages (Fr-De and De-Cs), and pairs of high- and low-resource languages (Ro-De and Et-Fr),
and pairs of both low-resource languages (Et-Ro and Gu-Tr). The detailed dataset statistics can
be found in Section 5.7.1.
4
http://www.statmt.org/wmt19/translation-task.html
45
Code #Test #Dev Overlap(%)

Tk 1852 1852 97.46%
Ig 1843 1843 96.31%
Li 2000 2000 87.75%
Yi 2000 2000 83.95%
Zu 2000 2000 83.45%
Table 5.2: For the top-5 most overlapped dev and test set on OPUS-100, the overlapping rate
is calculated as the percentage of dev set that appears in the test. The average overlapping rate
between dev and test is 15.26% across all language pairs.
5.3.2 Datasets: OPUS-100

To evaluate our approaches in the massive multilingual settings, we adopt the OPUS-100 corpus
from5 . OPUS-100 is also an English-centric dataset consisting of parallel data between English
and 100 other languages. We removed 5 languages (An, Dz, Hy, Mn, Yo) from OPUS since
they are not paired with a dev or test set. However, while constructing the oracle data from its
multilingual dev set, we found that the dev and test sets of OPUS-100 are noticeably noisy since
they are directly sampled from web-crawled OPUS collections6 . As shown in Table 5.2, several
dev sets have significant overlaps with their test sets. 15.26% of dev set samples appear in the test
set on average across all language pairs. This is a significant flaw of the OPUS-100 (v1.0) that
previous works have not noticed. To fix this, we rebuild the OPUS dataset as follows. Without
significantly modifying the dataset, we add an additional step of de-duplicating both the training
and dev sets against the test7 , and moving data from the training set to complement the dev
set due to de-duplication. We additionally sampled 2k zero-shot dev set using OPUS sampling
scripts8 to match the released 2k zero-shot test set. The detailed dataset statistics can be found in
section 5.7.19 .
5
https://opus.nlpl.eu/opus-100.php
6
https://opus.nlpl.eu/
7
We keep the OPUS-100 test set as is to make a fair comparison against previous works, although there are also
noticeable duplicates within the test set.
8
https://github.com/EdinburghNLP/opus-100-corpus
9
Our rebuilt OPUS dataset is released at https://github.com/yilinyang7/fairseq_multi_fix.
46
5.3.3 Training and Evaluation

For both WMT-10 and OPUS-100, we tokenize the dataset with the SentencePiece model [66]
and form a shared vocabulary of 64k tokens. We employ the Transformer-Big setting [7] in all
our experiments on the open-sourced Fairseq Implementation10 [37]. The model is optimized
using Adam [38] with a learning rate of 5 × 10−4 and 4000 warm-up steps. The multilingual
model is trained on 8 V100 GPUs with a batch size of 4096 tokens and a gradient accumulation
of 16 steps, which effectively simulates the training on 128 V100 GPUs. Our baseline model is
trained with 50k steps, while it usually converges much earlier. For evaluation, we employ beam
search decoding with a beam size of 5 and a length penalty of 1.0. The BLEU score is measured
by the de-tokenized case-sensitive SacreBLEU11 [67].
In order to evaluate the off-target translations, we utilize an off-the-shelf LangID model from
FastText [68] to detect the languages of translation outputs.
5.3.4 Construction of Oracle Data

On WMT-10, we use our human-labeled multi-way dev set together with the original English-
centric WMT dev set to construct the oracle data. On OPUS-100, we similarly combine the
zero-shot dev set with the original OPUS dev set for oracle data. On OPUS, we further merge
oracle data that consists of only English-centric dev sets, since it empirically obtains similar
performance while exhibiting noticeable speedups. The statistics of the constructed oracle data
are shown in Section 5.7.3.
5.4 Results
In this section, we will demonstrate the effectiveness of our approach on both WMT-10 and
OPUS-100 datasets. The full results are documented separately in Tables 5.5, 5.6, and 5.7 for
WMT-10, and Tables 5.8 and 5.9 for OPUS-100.
10
https://github.com/pytorch/fairseq
11
BLEU+case.mixed+lang.src-tgt+numrefs.1+smooth.exp+tok.13a+version.1.4.14
47
TLP α Avg. BLEU

Baseline 0 23.58
0.1 23.85
Meanpooling 0.2 23.73
0.3 23.93
0.1 23.81
CLS_Token 0.2 23.76
0.3 Diverged
Table 5.3: Comparing TLP approaches on WMT-10. BLEU is averaged across all English-centric
directions.
TGP Avg. BLEU

Baseline 23.58
Model-wise 24.35
Layer-wise 23.77
Matrix-wise 23.90
Table 5.4: Comparing TGP granularity on WMT-10. BLEU is averaged across all English-centric
directions.
5.4.1 TLP Results

Hyperparameter Tuning Table 5.3 shows the comparison between TLP implementations on
the WMT-10 dev set. We observe that the Meanpooling approach for TLP is both more stable and
delivers slightly better performance. In all the following experiments, we use the Meanpooling
approach for TLP, with α = 0.3 on WMT-10 and α = 0.1 on OPUS-100.
Performance From Tables 5.5 and 5.6 (row 4 vs. row 2), we can see that TLP outperforms
baselines in most En-X and X-En directions and all English-free directions as shown in Table 5.7
(row 4 vs. row 2). On average, TLP gains +0.4 BLEU on En-X, +0.28 BLEU on X-En, and
+2.12 BLEU on English-free directions. TLP also significantly reduces the off-target rate from
24.5% down to 6.0% (in Table 5.7). Meanwhile on OPUS-100, TLP performs similarly in
English-centric directions (in Table 5.8) while yielding +0.77 BLEU improvement in English-
free directions, together with a 65.8% → 60.5% drop in off-target occurrences (in Table 5.9).
These results demonstrate that by adding an auxiliary TLP loss, multilingual models much
better retain information about the target language, and moderately improved on English-free
48
En → X TLP TGP Fr Cs De Fi Lv Et Ro Hi Tr Gu Avg Off-Tgts

1 Bilingual - - 36.3 22.3 40.2 15.2 16.5 15.0 23.0 12.2 13.3 7.9 20.19 1.00%
2 Baseline - - 32.7 19.8 38.4 14.2 17.4 19.9 25.5 13.7 16.3 11.6 20.95 1.06%
3 + Finetune - - 21.2 13.0 24.4 10.4 12.1 14.6 23.4 11.6 10.5 17.4 15.86 0.86%
4 ✓ - 33.2 20.1 38.9 14.7 17.7 20.1 26.1 13.5 16.5 12.7 21.35 1.09%
5 - ✓ 33.6 20.0 38.7 14.7 17.6 20.3 25.8 16.0 16.5 18.6 22.18 0.94%
6
Ours ✓ ✓ 33.1 20.0 38.7 15.2 17.5 20.4 26.4 14.8 16.3 18.5 22.09 0.98%
( Baseline + ___ )
7 - ✓⋆ 32.8 20.1 37.4 14.8 17.7 19.7 25.8 15.7 16.4 18.4 21.88 0.92%
8 ✓ ✓⋆ 33.0 20.2 37.8 15.1 17.7 20.2 26.3 14.6 16.5 19.4 22.08 0.97%
Table 5.5: BLEU scores of English → 10 languages translation on WMT-10. ✓⋆ denotes TGP
training in a zero-shot manner for all evaluated English-free pairs. “Off-Tgts” column reports
the average off-target rates from the FastText LangID model, while the off-target rate on the
references is 0.81%.
X → En TLP TGP Fr Cs De Fi Lv Et Ro Hi Tr Gu Avg Off-Tgts

1 Bilingual - - 36.2 28.5 40.2 19.2 17.5 19.7 29.8 14.1 15.1 9.3 22.96 0.30%
2 Baseline - - 34.0 28.2 39.1 19.9 19.5 24.8 34.6 21.9 22.4 17.8 26.22 0.23%
3 + Finetune - - 24.7 22.3 30.1 16.9 16.2 21.1 39.4 17.7 17.6 17.0 22.30 0.13%
4 ✓ - 35.0 28.7 39.5 20.4 20.2 25.5 34.6 21.4 22.3 17.4 26.50 0.20%
5 - ✓ 34.5 29.1 40.0 20.8 20.1 26.3 39.5 23.4 22.8 19.5 27.60 0.20%
6
Ours ✓ ✓ 34.2 29.4 39.5 21.3 20.3 26.0 40.4 24.1 23.0 19.8 27.80 0.19%
( Baseline + ___ )
7 - ✓⋆ 33.9 28.7 38.8 21.0 20.0 26.4 39.5 24.0 22.6 19.2 27.41 0.15%
8 ✓ ✓⋆ 34.4 28.8 39.6 21.3 20.5 26.8 40.6 24.2 22.9 20.8 27.99 0.20%
Table 5.6: BLEU scores of 10 languages → English translation on WMT-10. ✓⋆ denotes TGP
training in a zero-shot manner for all evaluated English-free pairs. “Off-Tgts” column reports
the average off-target rates from the FastText LangID model, while the off-target rate on the
references is 0.12%.
pairs.
5.4.2 TGP Results

Settings Similar to [62, 63], the conflict detection and de-conflict projection of TGP training
could be done with different granularities. We compare three options: (1) model-wise: flatten all
parameters into one vector, and perform projection on the entire model; (2) layer-wise: perform
individually for each layer of encoder and decoder; (3) matrix-wise: perform individually for
each parameter matrix. From Table 5.4, we found operating on the model level gives the best
performance, and as a result all our TGP experiments are done on the model level. We then
perform TGP training for 10k steps on the 40k-step pretrained model.
49
Fr-De De-Cs Ro-De Et-Fr Et-Ro Gu-Tr BLEU Off-Tgt

En-Free TLP TGP
← → ← → ← → ← → ← → ← → Avg Avg (%)
1 Pivoting - - 24.9 19.3 19.4 18.9 19.1 18.8 16.2 20.9 16.4 16.8 5.2 6.4 16.86 1.1%
2 Baseline - - 18.5 12.8 15.8 13.6 17.5 16.0 10.3 13.7 12.5 14.4 0.9 1.9 12.33 24.5%
3 + Finetune - - 17.9 15.1 14.6 13.1 18.7 14.6 12.3 16.4 12.6 17.0 8.1 7.0 13.95 0.7%
4 ✓ - 21.4 15.7 17.8 15.8 18.1 17.0 14.1 17.3 14.0 15.5 3.0 3.8 14.45 6.0%
5 - ✓ 25.4 19.4 20.0 18.7 22.7 19.6 16.5 22.4 17.0 20.2 6.2 6.7 17.90 0.9%
Ours
6 ( Baseline + _ ) ✓ ✓ 25.2 19.7 19.7 18.9 23.0 19.8 16.1 21.4 16.7 20.8 6.4 7.3 17.92 0.9%
7 - ✓⋆ 24.5 18.5 18.8 18.2 21.8 18.2 16.6 20.9 16.1 19.5 5.6 6.9 17.13 2.0%
8 ✓ ✓⋆ 24.2 18.3 20.0 18.3 22.3 19.5 15.9 21.5 16.1 20.1 5.6 7.1 17.41 2.1%
Table 5.7: BLEU scores of English-free translations on WMT-10. ✓⋆ denotes TGP training in a
zero-shot manner for all evaluated English-free pairs. As a reference, the average off-target rate
reported by the FastText LangID model is 0.68% on the references.
X → English English → X
English-Centric TLP TGP
High Med Low All High Med Low All
1 [59] (24L) - - 30.29 32.58 31.90 31.36 23.69 25.61 22.24 23.96
2 Baseline - - 30.27 33.50 31.94 31.61 23.64 29.13 29.38 26.56
3 + Finetune - - 19.85 29.93 36.49 26.57 15.45 23.84 30.05 21.21
4 ✓ - 30.31 33.17 33.06 31.78 23.71 29.11 29.24 26.55
5 - ✓ 30.17 38.21 42.23 35.26 23.51 30.61 33.60 27.88
6
Ours ✓ ✓ 29.96 38.32 41.94 35.13 23.35 30.23 33.63 27.69
( Baseline + __ )
7 - ✓⋆ 30.07 38.28 42.25 35.24 23.53 30.53 33.97 27.95
8 ✓ ✓⋆ 29.83 38.50 42.51 35.24 23.30 30.28 33.43 27.64
Table 5.8: Average test BLEU for High/Medium/Low-resource language pairs on OPUS-100
dataset. All denotes the average BLEU for all language pairs. ✓⋆ denotes TGP training in a
zero-shot manner for all evaluated English-free pairs.
Performance In Tables 5.5, 5.6 and 5.7 (row 5 vs. 2), TGP gains significant improvements in
all directions of WMT-10: averaging +1.23 BLEU on En-X, +1.38 BLEU on X-En, and +5.57
BLEU on English-free directions, while also reducing the off-target rates from 24.5% down
to only 0.9%. Similar gains could also be found on OPUS-100 (in Tables 5.8 and 5.9): +3.65
BLEU on En-X, +1.32 BLEU on X-En, and +10.63 BLEU on English-free, and a whopping
65.8% → 4.8% reduction to off-target occurrences. These results demonstrate the overwhelming
effectiveness of TGP on all translation tasks as well as in reducing off-target cases.
Learning curves Figures 5.1 and 5.2 illustrate the learning curves of TGP on WMT-10. Dif-
ferent from the baseline curves, TGP observes a slight increase in training loss, which shows
50
De-Fr Ru-Fr Nl-De Zh-Ru Zh-Ar Nl-Ar BLEU Off-Tgt

En-Free TLP TGP
← → ← → ← → ← → ← → ← → Avg Avg (%)
1 Pivoting - - 18.5 21.5 21.0 26.7 21.7 19.7 13.6 20.2 14.9 17.8 16.6 5.7 18.16 4.2%
2 Baseline - - 3.5 2.9 7.2 4.0 4.9 4.8 4.5 11.0 2.8 11.1 1.5 2.6 5.07 65.8%
3 + Finetune - - 13.1 14.9 15.9 17.3 16.0 14.4 12.8 14.8 15.2 12.9 8.5 3.1 13.24 3.7%
4 ✓ - 3.9 3.3 8.0 6.4 6.1 5.0 7.7 10.6 3.2 11.2 1.3 3.4 5.84 60.5%
5 - ✓ 16.1 18.2 18.6 21.1 19.1 18.5 13.9 16.2 15.0 14.4 12.3 5.0 15.70 4.8%
Ours
6 ( Baseline + __ ) ✓ ✓ 16.4 17.7 18.3 21.0 19.0 18.3 12.7 16.2 14.2 14.4 12.2 5.0 15.45 4.7%
7 - ✓⋆ 4.2 12.2 12.8 20.9 10.0 5.9 11.5 13.5 12.8 12.6 11.8 4.5 11.06 31.1%
8 ✓ ✓⋆ 6.6 14.2 16.7 21.4 16.2 8.6 12.9 14.2 14.7 12.6 11.8 4.6 12.88 16.9%
Table 5.9: BLEU scores of English-free translations on OPUS-100. ✓⋆ denotes TGP training in
a zero-shot manner for all evaluated English-free pairs. As a reference, the average off-target rate
reported by the FastText LangID model is 4.85% on the references.
TGP as a regularizer prevents the model from overfitting to the training set. Meanwhile, a steady
increase in both English-centric and English-free test BLEU demonstrates the effectiveness of
TGP training.
Finetuning on oracle data Suggested by [64], we explore another baseline usage of oracle data:
direct finetuning. For finetuning, we concatenate all oracle data from different target languages
together. With the same settings as TGP (finetuning for 10k steps on the 40k-step baseline), we
also observe a noticeable improvement in English-free directions: an average of +1.62 BLEU on
WMT-10 and +8.17 BLEU on OPUS-100, with the most reduction on the off-target occurrences
(row 3 of Tables 5.7 and 5.9). However in comparison to TGP, directly finetuning on oracle
data lacks the step of separately modeling for different target languages and the crucial step of
de-conflicting, thus it hurts the English-centric (En-X and X-En) directions while also lagging as
much as -3.95 BLEU on English-free pairs (Table 5.7, row 3 vs. 5).
5.4.3 TGP in a Zero-Shot Setting

Although TGP and Finetuning obtain significant reductions on off-target cases, they both assume
some amount of direct parallel data on English-free pairs, while in reality, such direct parallel data
may not exist in the extreme low-resource scenarios. To simulate a zero-shot setting, we build
a new oracle dataset that explicitly excludes parallel data of all evaluated English-free pairs.12
12
We exclude the parallel data of 6 evaluation pairs from oracle data while keeping the others. More details in
Section 5.7.3.
51
Training Loss
3.66
3.65
3.64
3.63
Loss
3.62
3.61
3.60
3.59 Baseline
TGP
3.58
40000 42000 44000 46000 48000 50000
Training Steps
Figure 5.1: The training loss curve of TGP on WMT-10.
In this setting, all evaluation pairs are trained in a strict zero-shot manner to test the system’s
generalization ability.
Performance In Tables 5.5, 5.6, 5.7 row 7 with ✓⋆ , TGP in a zero-shot manner slightly lags
behind TGP with full oracle data, while still gaining significant improvement compared to the
baseline. On average, we observe a gain of +0.93 BLEU on En-X, +1.19 BLEU on X-En, and
+4.8 on English-free compared to baseline (row 7 vs. 2), and a slight decrease of -0.3 BLEU on
En-X, -0.19 BLEU on X-En and -0.77 BLEU on English-free compared to TGP with full oracle
set (row 7 vs. 5). Meanwhile on OPUS-100 (in Tables 5.8,5.9), we also observe a consistent gain
against the baseline (row 7 vs. 2), but a noticeable -4.64 BLEU drop on English-free pairs against
TGP with full oracle data (row 7 vs. 5). The performance drop (zero-shot vs. full data) illustrates
that thousands of parallel samples13 could greatly help TGP on zero-shot translations, and we
suspect the drop of only -0.77 BLEU on WMT-10 is due to the multi-way nature of our WMT
oracle data. Meanwhile, TGP in a zero-shot setting is still shown to greatly improve translation
performance and significantly reduce off-target occurrences (24.5% → 2.0% on WMT and 65.8%
13
In our case, we obtain 1k for WMT and 2k for OPUS.
52
English-Centric BLEU
Baseline
24.8 TGP
24.6
24.4
BLEU
24.2
24.0
23.8
40000 42000 44000 46000 48000 50000
Training Steps
English-Free BLEU
18
17
16
15 Baseline
BLEU
TGP
14
13
12
40000 42000 44000 46000 48000 50000
Training Steps
Figure 5.2: The test BLEU curves of TGP on WMT-10.
53
→ 31.1% on OPUS).
5.4.4 Joint TLP+TGP

TLP models could be seamlessly adopted in TGP training, by replacing the original NMT loss
with a joint NMT+TLP loss. Comparing the joint TLP+TGP approach to TGP-only (row 6 vs. 5
in Tables 5.5-5.9), we observe no significant differences in the full oracle data scenario (changes
within ± 0.3 BLEU). However in the zero-shot setting, the joint TLP+TGP approach noticeably
outperforms TGP-only by +1.82 BLEU on average in English-free pairs (Table 5.9, row 8 vs. 7).
Given that TLP alone is only able to gain +0.77 BLEU (row 4 vs. 2), it hints that TLP and TGP
have a synergy effect in the extremely low resource scenario.
5.5 Discussions on Off-Target Translations

In this section, we will discuss the off-target translation in the English-centric directions and its
relationship with token-level off-targets.
5.5.1 Off-Targets on English-Centric Pairs

Previous literature only studies the off-target translations in the zero-shot directions. However, we
show in Tables 5.5 and 5.6 that off-target translation also occurs in the English-centric directions
(although on a smaller scale). Since we are using an imperfect LangID model, we quantify its
error margin as the off-target rates reported on the references14 . We could then observe that the
baseline model is producing 0.25% and 0.18% more off-target translations than the references in
En-X and X-En directions respectively, which are also reduced by our proposed TLP and TGP
approaches.
5.5.2 Token-Level Off-Targets

Given a sentence-level LangID model, we are also curious about how it represents errors at
the token level. We attempt to simply quantify the token-level off-target rates by checking
whether each token appears in the training data. Surprisingly, results in Table 5.10 show that
14
We assume the WMT references are always in-target, since they are collected from human translators.
54
Token-Level VS. Sentence-Level Off-Target Rates

8
Sentence-Level Off-Target Rates (%)
7
6
5
4
3
2
1
0 5 10 15 20 25 30
Random Token-Level Off-Target Rates (%)
Figure 5.3: Relationship between the random token-level probability p and the reported sentence-
level off-target rates. Token-level off-targets are introduced by replacing the in-target token with
a random off-target token with a probability p. Analysis done on the WMT-10 English-free
references.
55
Token Off-Tgt En→X X→En En-Free

Reference 0.10% 0.02% 0.73%
Baseline 0.08% 0.02% 0.16%
+TLP 0.08% 0.02% 0.09%
+TGP 0.08% 0.02% 0.07%
Table 5.10: Token-level off-target rates on WMT-10 are quantified by whether appearing in the
training set.
all systems contain lower token-level off-targets than the references. We hypothesize that it is
attributed to two main reasons: 1) training set also contains noisy off-target tokens. 2) there
are domain/vocabulary mismatches between training and test set, especially for the English-free
pairs.
In order to test the robustness of the FastText LangID model as well as to relate our reported
sentence-level scores to the token-level, we randomly introduced off-target tokens to the refer-
ences and observed the sentence-level scores. Specifically, we replaced the in-target token with a
random off-target one with a probability p. Figure 5.3 shows a near exponential curve between
the sentence-level scores and probability p. We could also observe that the sentence-level LangID
model is somewhat robust to token-level off-target noises, e.g. it reports around 4% off-target
rates given that 20% of the tokens are replaced with off-target ones.
Language #Oracle #Oracle∗

En 15976 15976
Fr 8776 7180
Cs 8776 7978
De 8776 6383
Fi 8776 8776
Lv 8776 8776
Et 8776 7180
Ro 8776 7180
Hi 8776 8776
Tr 8776 7978
Gu 8776 7978
Table 5.11: Statistics of the constructed oracle data on WMT-10. Oracle∗ denotes after excluding
the direct data of all evaluated language pairs. Languages are ranked by the available bilingual
resources.
56
Language #Oracle #Oracle∗

Ar 9600 6400
De 9600 6400
En 148427 148427
Fr 9600 6400
Gd 1284 1284
Ig 1474 1474
Kn 733 733
Nl 9600 6400
Or 1053 1053
Ru 9600 6400
Tk 1481 1481
Zh 9600 6400
Others 1600 1600
Table 5.12: Statistics of the constructed oracle data on OPUS-100. Oracle∗ denotes after ex-
cluding the direct data of all evaluated language pairs. Language ranked alphabetically. “Others”
includes all other 83 languages.
5.6 TGP Explanation and Visualization

Following the significant improvement from the TGP approach, we are curious about how TGP
works and specifically how the gradient projection affects the training dynamics. For explanation
and visualization purposes, we examine TGP training on a 2-D example following the below
assumptions:
1. The training and dev set are differently distributed.
2. The training and dev loss is in a similar form but have different local (or global) minima.
Without loss of generality, we set the training loss to be Ltrain = (x−1)2 +(y −1)2 , and dev loss
to be Ldev = (x + 1)2 + (y + 1)2 . Thus, both losses are in quadratic form while having different
global minima, where training loss converges to (1, 1) and dev loss converges to (−1, −1).
In Figure 5.4, we visualize the TGP projection on a simple 2-D case, where each point on the
plot represents a model state (i.e. parameter set). At any point, the training gradient is pointing
towards (1, 1), and the dev gradient is pointing towards (−1, −1). Thus, there exists a circle on
which the training gradient is always perpendicular to the dev gradient (i.e. their dot product
57
Figure 5.4: TGP projection visualization on a 2-D case, imagining a third axis pointing outwards
representing the train/dev loss curves. Any point on the plot represents a model state, where the
blue arrow represents the training gradient and the red arrow represents the dev gradient.
58
is zero). This circle in our case takes the form of x2 + y 2 = 2. With a simple mathematical
deduction, every point outsides the circle would have a positive dot product between train and
dev gradient, and points inside the circle would have a negative dot product. In terms of TGP, if
the model falls outside the circle, its training gradient will remain unchanged, yet if the model
falls on or inside the circle, its training gradient will be projected onto the normal plane of the
dev gradient.
We then plot the vector field of TGP gradients in Figure 5.5, where the green and gray arrows
represent unprojected and projected training gradients respectively, and their sizes illustrate the
magnitude. We can observe that when outsides the circle (i.e. the boundary between grey and
green field), the TGP gradients always point towards the training global minima and eventually
enters the circle. When the model parameter enters the circle, TGP gradients always follow
the dev loss level sets (i.e. the normal plane of dev gradients) to the middle ground where the
gradients slowly converge to zero. There is also a special line connecting the train and dev
minima (plotted in Figure 5.4) where the training and dev gradients are in the opposite direction
and the TGP gradient is always zero.
The above 2-D case illustrates the TGP training dynamics, where the model firstly updates
towards the training local minima, then whenever training and dev gradients are conflicting, it
will follow the dev loss level sets and converges to places where training and dev gradients are in
the opposite direction.
Given the assumption that the dev set is always cleaner than the training set, and dev minima
is always closer to the unknown test minima, TGP will guarantee to converge to a better local
minima than the training minima. Yet, the training process in reality involves stochastic batched
inputs, thus it remains unclear how the stochastic batches affect the TGP training dynamics and
whether the stochasticity improves the generality.
5.7 Dataset Statistics
5.7.1 WMT-10 Data

We concatenate all resources except WikiTitles from WMT2019. For Fr and Cs, we randomly
sample 10M sentence pairs from the full corpus. The detailed statistics of bitext data can be
found in Table 5.13. We randomly sample 1, 000 sentence pairs from each individual validation
set and concatenate them to construct a multilingual validation set.
59
1.0
0.5
0.0
x2
0.5
1.0
1.5
1.5 1.0 0.5 0.0 0.5 1.0
x1
Figure 5.5: TGP gradient visualization on a 2-D case. The red and blue contours represent
train and dev loss level sets. Arrows represent the TGP gradients, where green arrows represent
unprojected gradients, and grey arrows represent projected ones.
60
Code Language #Train Dev Test

Fr French 10M Newstest13 Newstest15
Cs Czech 10M Newstest16 Newstest18
De German 4.6M Newstest16 Newstest18
Fi Finnish 4.8M Newstest16 Newstest18
Lv Latvian 1.4M Newsdev17 Newstest17
Et Estonian 0.7M Newsdev18 Newstest18
Ro Romanian 0.5M Newsdev16 Newstest16
Hi Hindi 0.26M Newsdev14 Newstest14
Tr Turkish 0.18M Newstest16 Newstest18
Gu Gujarati 0.08M Newsdev19 Newstest19
Table 5.13: Statistics of the WMT-10 dataset.
5.7.2 OPUS-100 Dataset

After de-duplicating both training and dev set against the test set, the statistics of the OPUS-100
dataset is shown in Table 5.14.
5.7.3 Statistics of the Oracle Data

We concatenated all available parallel dev sets individually for each target language. Tables 5.11
and 5.12 illustrate the statistics of our oracle data, before and after excluding the parallel English-
free data of 6 evaluation pairs.
61
Code Language Train Dev Test Code Language Train Dev Test
af Afrikaans 275451 2000 2000 lv Latvian 999976 2000 2000
am Amharic 88979 2000 2000 mg Malagasy 590759 2000 2000
ar Arabic 999988 2000 2000 mk Macedonian 999955 2000 2000
as Assamese 138277 2000 2000 ml Malayalam 822709 2000 2000
az Azerbaijani 262060 2000 2000 mr Marathi 26904 2000 2000
be Belarusian 67183 2000 2000 ms Malay 999973 2000 2000
bg Bulgarian 999983 2000 2000 mt Maltese 999941 2000 2000
bn Bengali 999990 2000 2000 my Burmese 23202 2000 2000
br Breton 153283 2000 2000 nb Norwegian Bokmål 142658 2000 2000
bs Bosnian 999991 2000 2000 ne Nepali 405723 2000 2000
ca Catalan 999983 2000 2000 nl Dutch 999984 2000 2000
cs Czech 999988 2000 2000 nn Norwegian Nynorsk 485547 2000 2000
cy Welsh 289007 2000 2000 no Norwegian 999977 2000 2000
da Danish 999991 2000 2000 oc Occitan 34886 2000 2000
de German 999993 2000 2000 or Oriya 14027 1317 1318
el Greek 999982 2000 2000 pa Panjabi 106351 2000 2000
eo Esperanto 337074 2000 2000 pl Polish 999991 2000 2000
es Spanish 999996 2000 2000 ps Pashto 77765 2000 2000
et Estonian 999993 2000 2000 pt Portuguese 999991 2000 2000
eu Basque 999991 2000 2000 ro Romanian 999986 2000 2000
fa Persian 999990 2000 2000 ru Russian 999982 2000 2000
fi Finnish 999993 2000 2000 rw Kinyarwanda 173028 2000 2000
fr French 999997 2000 2000 se Northern Sami 35207 2000 2000
fy Western Frisian 53381 2000 2000 sh Serbo-Croatian 267159 2000 2000
ga Irish 289339 2000 2000 si Sinhala 979052 2000 2000
gd Gaelic 15689 1605 1606 sk Slovak 999978 2000 2000
gl Galician 515318 2000 2000 sl Slovenian 999987 2000 2000
gu Gujarati 317723 2000 2000 sq Albanian 999971 2000 2000
ha Hausa 97980 2000 2000 sr Serbian 999988 2000 2000
he Hebrew 999973 2000 2000 sv Swedish 999972 2000 2000
hi Hindi 534192 2000 2000 ta Tamil 226495 2000 2000
hr Croatian 999991 2000 2000 te Telugu 63657 2000 2000
hu Hungarian 999982 2000 2000 tg Tajik 193744 2000 2000
id Indonesian 999976 2000 2000 th Thai 999972 2000 2000
ig Igbo 16640 1843 1843 tk Turkmen 11305 1852 1852
is Icelandic 999980 2000 2000 tr Turkish 999959 2000 2000
it Italian 999988 2000 2000 tt Tatar 100779 2000 2000
ja Japanese 999986 2000 2000 ug Uighur 72160 2000 2000
ka Georgian 377259 2000 2000 uk Ukrainian 999963 2000 2000
kk Kazakh 79132 2000 2000 ur Urdu 753753 2000 2000
km Central Khmer 110924 2000 2000 uz Uzbek 172885 2000 2000
kn Kannada 14303 917 918 vi Vietnamese 999960 2000 2000
ko Korean 999989 2000 2000 wa Walloon 103329 2000 2000
ku Kurdish 143823 2000 2000 xh Xhosa 439612 2000 2000
ky Kyrgyz 25991 2000 2000 yi Yiddish 13331 2000 2000
li Limburgan 23780 2000 2000 zh Chinese 1000000 2000 2000
lt Lithuanian 999973 2000 2000 zu Zulu 36947 2000 2000
Table 5.14: Statistics of the OPUS-100 dataset after de-duplicating against the test set.
62
Chapter 6: Explaining and Improving Multilingual Beam Search Decoding
6.1 Motivation
With Neural Machine Translation (NMT) [2, 7] becoming the state-of-the-art approach in the
bilingual Machine Translation literature, Multilingual Neural Machine Translation (MNMT) has
attracted much attentions [5], mainly because: a) It enables one model to translate between
multiple language pairs and thus reduces the model and deployment complexity from O(N 2 ) to
O(1). b) It enables transfer learning between high-resource and low-resource languages. One
attractive advantage of such transfer learning is zero-shot translation, where the multilingual
model is able to translate between language pairs unseen during training. For example, after
training on French to English and English to German data, the model could perform French to
German translation.
Despite the theoretical benefits, recent studies have found an overwhelming amount of off-
target translation especially for the zero-shot directions [59, 69], where the translation is not
in the intended language. Existing methods all aim to mitigate off-targets during training. [58,
59] apply Back Translation (BT) to generate synthetic training data for the zero-shot pairs. [69]
introduces a language prediction loss and regularizes the training gradients with a held-out oracle
set. Yet, none of the previous work has investigated the off-target issue at decoding time, i.e. how
off-target translations emerge and come to outscore on-target translations during beam search
decoding.
In this work, we first examine how and when off-target translation emerges during beam
search decoding, then we propose Language-Informed Beam Search (LIBS), a general algorithm
to reduce off-target generation during beam search decoding by incorporating an off-the-shelf
Language Identification (LiD) model. Our experiment results on two large-scale popular MNMT
datasets (i.e. WMT and OPUS) demonstrate the effectiveness of LIBS in both reducing off-target
rates and improving general translation performance. Moreover, LIBS can be added post-hoc to
reduce off-target translation on any existing multilingual model without requiring any additional
data or training.
63
6.2 Experiment Setup

In this section, we illustrate the data and model setup we used, and the experimental results of
our Language-Informed Beam Search algorithm.
6.2.1 Dataset
Following [65, 69], we conduct experiments on two widely used large-scale MNMT datasets
WMT1 and OPUS-1002 , where the WMT dataset is concatenated from previous year WMT train-
ing data including English and 10 other languages. Since the WMT competition does not come
with zero-shot evaluation data, we use our human labeled multi-way aligned test set from [69],
based on the WMT-19 test set.
6.2.2 Model Training and Evaluation

For both WMT and OPUS-100, we tokenize the dataset with the SentencePiece model [66]
to form a shared vocabulary of 64k tokens. We adopt the Transformer-Big setting [7] in our
experiments on the open-sourced Fairseq codebase3 [37]. The model is optimized using the
Adam optimizer [38] with a learning rate of 5 × 10−4 , 4000 warm-up steps, and a total of 50k
training steps. The multilingual model is trained on 8 V100 GPUs with a batch size of 8192
tokens and gradient accumulation of 8 steps, which essentially simulates the training on 64 V100
GPUs. To evaluate the baseline model, we employ beam search decoding with a beam size of 5
and a length penalty of 1.0. The BLEU score is then measured by the de-tokenized case-sensitive
SacreBLEU4 [67].
To evaluate the off-target rates, we borrow the off-the-shelf LiD model5 from FastText [68]
to detect the language for system translations. Similar to [69], we observe an overwhelming
off-target rate (averaging 22.9%) across zero-shot pairs on our strong baseline model.
1
Referred to as “WMT-10” in [65, 69], we denoted it as WMT to disambiguate against the WMT 2010 campaign.
2
We use the deduplicated version from [69].
3
https://github.com/pytorch/fairseq
4
BLEU+case.mixed+lang.src-tgt+numrefs.1+smooth.exp+tok.13a+version.1.4.14
5
https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
64
Direction Beam size BLEU Off-Target Rate

5 17.3 23.1%
De→Fr 10 16.1 31.7%
20 14.3 41.4%
5 15.4 12.3%
Cs→De 10 15.0 17.4%
20 14.2 22.0%
Table 6.1: Multilingual beam search curse on WMT De→Fr and Cs→De.
6.3 Analyzing Off-Target Occurrence During Beam Search

To understand the off-target occurrence during beam search, we analyze the off-target error types
on different language pairs, as well as conducting case studies with varying beam sizes.
6.3.1 Multilingual Beam Search Curse

The beam search curse phenomenon [3] is widely observed in bilingual NMT models. Given a
larger beam size, the beam search process would explore a larger search space and choose from
a larger candidate pool. Yet empirically, translation performance usually drops significantly with
increasing beam sizes. In our study, we also found this phenomenon prevailing in the multilingual
system and highly related to the off-target translation error.
As an example, we demonstrate the beam search curse on WMT De→Fr and Cs→De trans-
lation, since both are between high-resource languages and with decent translation performance
(greater than 15 BLEU score).
Table 6.1 illustrates the results on WMT De→Fr and Cs→De. We could clearly observe that
the off-target rate grows sub-linearly with the beam size, and as a result the BLEU score drops
significantly with increasing beam sizes. It then raises the curious question of why the off-target
rate increases drastically with larger beam sizes, and whether the performance drop (i.e. BLEU
decrease) is mainly due to the off-target errors.
6.3.2 Off-Target Error Analysis

As a detailed analysis, we study the off-target error type between six zero-shot pairs (i.e. 12
translation directions) from the WMT dataset. We categorize the off-target errors into three types:
65
b=5 b = 20
Directions
Total →English →Source Others Total →English →Source Others
De→Fr 23.1% 11.8% 11.1% 0.2% 41.4% 18.5% 22.8% 0.1%
Fr→De 39.9% 10.5% 29.4% 0.0% 62.7% 17.2% 45.5% 0.0%
Cs→De 12.3% 8.5% 3.6% 0.2% 22.0% 17.3% 4.5% 0.2%
De→Cs 19.0% 2.5% 15.8% 0.7% 27.6% 5.9% 21.3% 0.4%
De→Ro 1.6% 0.8% 0.5% 0.3% 1.9% 1.1% 0.5% 0.3%
Ro→De 7.3% 5.9% 0.7% 0.7% 16.3% 14.8% 0.7% 0.8%
Fr→Et 22.5% 8.1% 12.4% 2.0% 30.6% 13.6% 15.6% 1.4%
Et→Fr 26.1% 16.5% 6.3% 3.3% 36.6% 26.2% 6.7% 3.7%
Ro→Et 10.8% 6.3% 1.5% 3.0% 14.8% 10.4% 1.6% 2.8%
Et→Ro 2.0% 0.5% 0.2% 1.3% 1.9% 0.6% 0.3% 1.0%
Tr→Gu 73.7% 73.3% 0.2% 0.2% 78.7% 78.1% 0.2% 0.4%
Gu→Tr 36.4% 35.6% 0.1% 0.7% 41.4% 39.6% 0.0% 1.8%
Average 22.9% 15.0% 6.8% 1.1% 31.3% 20.3% 10.0% 1.1%
Table 6.2: Off-Target error analysis on 12 WMT zero-shot directions.
Source (Fr) System Output (→De)

Sa décision a laissé tout le monde sans voix. Sa décision a laissé tout le monde sans voix.
Abandonnez Chequers et commencez à Abandonnez Chequers et commencez à
écouter. » écouter »
C’est une très bonne chose, sagt Jaynes.
« C’est une très bonne chose », dit Jaynes.
Table 6.3: Case studies for “→Source” errors. We sample three source-translation pairs from the
WMT Fr→De test set (with translation LiD-ed as French). Token differences are colored in red.
translating into English6 , translating into source language, and others.

The detailed off-target error analysis of WMT zero-shot direction is shown in Table 6.2. We
find that even though the off-target error is overwhelming across languages, it could easily be
categorized into mostly two types: translating into English and “translating” into source. The
“Others” error type only comprises a negligible 1.1% of cases, given the FastText LiD model has
an error margin of 0.81% [69].
“→Source” errors We hypothesize that this error is related to the previously studied “source
copying” behavior [17] on the bilingual NMT model. We then sample three cases from this
error type (shown in Table 6.3). The case study confirms that the “→Source” error type is the
6
English is never the correct target language in our 12 studied translation directions.
66
160
140
120
100
Counts
80
60
40
20
0
20 40 60 80 100
Sentence BLEU
Figure 6.1: The sentence BLEU distribution between source and system translation from WMT
Fr→De “→Source” errors.
same as source copying behavior on bilingual models for these cases. To quantify the degree of
source copying, we run Sentence BLEU evaluation7 between source and system translation on
WMT Fr→De “→Source” errors. The sentence BLEU distribution is shown in Figure 6.1 with
an average sentence BLEU of 85.3. It clearly demonstrates that the “→Source” error strongly
displays a source copying behavior and is somehow promoted by larger beam sizes.
“→English” errors Since none of our evaluated direction includes English as the target lan-
guage, translating into English is never promoted and always trigger an off-target error. We
similarly sampled three “→English” error cases from the WMT Fr→De test set. We also com-
pare them against the real Fr→En translations with the same model and French input. This case
study (in Table 6.4) hints that the “→English” generations from WMT Fr→De are generally
similar but slightly worse “English” translations compared to Fr→En. To demonstrate the sim-
ilarity, we plot the sentence BLEU distribution for all 172 “→English” errors between Fr→De
and Fr→En translations in Figure 6.2. It demonstrates a strong similarity between Fr→De and
7
We use the sentence_bleu function from [67] with smooth_method=‘floor’: https://github.
com/mjpost/sacrebleu/blob/master/sacrebleu/compat.py
67
Source (Fr) System Output (→De) System Output (→En)

Comme la campagne était Since the campaign was very Since the campaign was very
très avancée, elle avait pris advanced, it had fallen behind advanced, it had lagged be-
du retard dans la collecte de in the collection of funds, and hind in raising funds and
fonds, et a donc juré qu’elle therefore swore that it would therefore swore that it would
ne participerait pas à moins not participate to less than not participate unless it raised
de recueillir 2 millions de dol- raise 2 million dollars. $2 million.
lars.
Woods a perdu ses quatre Woods has lost his four Woods lost his four matches
matchs en France et détient matches in France and now in France and now holds a
maintenant un record de 13- holds a record of 13-21-3 in record of 13-21-3 in the Ry-
21-3 en carrière en Ryder career in Ryder Cup. der Cup.
Cup.
Le couple réfute être raciste, The couple refutes being The couple refutes being
et assimile les poursuites à racist, and assimilates prose- racist and treats prosecutions
une « extorsion ». cutions to a “repression”. as “extortion”.
Table 6.4: Case studies for “→English” errors. We sample three source-translation pairs from the
WMT Fr→De test set (with translation LiD-ed as English). As a comparison, we also show the
output when the model is asked to translate into English. Token differences are colored in red.
Direction BLEU chrF2 TER*

Fr→De 26.92 0.572 0.62
Fr→En 34.91 0.611 0.53
Table 6.5: English translation quality for all WMT Fr→De “→English” errors. Both translation
directions are evaluated against English human references. *TER score is lower the better.
Fr→En translations, with an average sentence BLEU of 55.9. Since the evaluation data of the
WMT corpus is multi-way aligned, we can evaluate the →English translation quality for both
Fr→De and Fr→En against the English human references (in Table 6.5). Results confirm our
observation that the “→English” errors are generally poorer English translations.
6.3.3 Beam Search Process Analysis

To understand how “→English” and “→Source” errors emerge during beam search and why both
errors dramatically increase with larger beam sizes, we investigate the step-by-step beam search
process with case studies. Table 6.6 and 6.7 illustrates one representative decoding example
68
20
15
Counts
10
0
20 40 60 80 100
Sentence BLEU
Figure 6.2: The sentence BLEU distribution between WMT Fr→De “→English” translation and
Fr→En translation with the same source input.
from the WMT Fr→De test set with b = 5, 20 and French source “Nous avons maintenant une
excellente relation. »”. For b = 20, we only print the top-5 beams due to the space limit. From
this example, we have a few observations:
• English candidates are live in the early steps (1-3) of b = 5 but tend to be dropped in
later time steps. Meanwhile for b = 20, both English and French candidates are kept alive
throughout the decoding process: even though they fall out of the top-5 beams at the 4th
step, the off-target candidates quickly catch up and are ranked highest by the 7th step.
• Closely observing the winning English candidate of b = 20, we notice it suffers a heavy
penalty in the first step (log prob is -3.58), yet all following steps experience small penalties.
• The final English translation by b = 20 is indeed a “better” candidate with greater proba-
bilities (i.e. model score) compared to the final German translation by b = 5, therefore, if
this off-target candidate is retained throughout the process it will naturally win out against
all valid on-target translations.
From the above observations, we can try to answer our previous research questions.
69
RQ1. How do “→English” and “→Source” errors emerge during decoding?
We first observe that both “→English” and “→Source” candidates are easily accessible in
the early steps of decoding. Meanwhile, the models place a low probability on decoding the first
source or English token, but relatively high transition probabilities for the remaining source or
English tokens can result in off-target sentences scoring more highly than on-target.
RQ2. Why do both errors dramatically increase with larger beam sizes?
With a larger beam size budget, it is more likely to retain off-target candidates in the earlier
steps, when they receive heavy early step penalties. Yet since off-target candidates experience
fewer penalties in the later steps, they tend to win out over on-target candidates in the long
run. Noted, we found it to be a general case that the off-target continuations receive a higher
probability (less penalty) than the on-target ones, even though the first off-target token receives a
heavy penalty by the model. We hypothesize that it is due to the recency bias and poor calibration,
yet it remains an interesting research question for future work.
Possible Solutions Off-target candidate gaining greater model score demonstrates that the
model is poorly calibrated, especially for the later steps of autoregressive decoding. Methods
with additional training data [58, 59] or regularizations [69] could alleviate this issue during
training with a well-calibrated model. In this work, with the knowledge of how off-target cases
emerge during decoding, we attempt to fix this issue solely during decoding time by improving
on beam search algorithm even with a proven poorly calibrated model.
6.4 Language-Informed Beam Search (LIBS)

The standard beam search process originating from the bilingual NMT model is target-language-
agnostic and is found to produce an overwhelming amount of off-target translations [59, 69]. Yet
the target language (i.e. the desired language for generation) is always known during decoding,
thus it is straightforward to enforce the desired language to reduce the off-target rates without any
additional training or data. We thus propose Language-Informed Beam Search (LIBS), a general
decoding algorithm to inform the beam search process of the desired language during decoding.
70
Algorithm 2: Langugage-Informed Beam Search

Input :MNMT model θ, LiD model γ, source sentence x, target language T , beam
size b, pre-select window size w
Output :Finished candidate set C ← ∅
▷ Initialize each beam i with BOS symbols and zero score
1 Bi ← {⟨0.0, <s>⟩}
2 repeat
▷ Pre-select top-w candidates from each beam i
3 Wi ← topw {⟨s · pθ (y | x, y), y ◦ y⟩ | ⟨s, y⟩ ∈ Bi , y ∈ V}
▷ Sort all candidates by the linearly combined NMT and LiD log probabilities
W ← Sort{⟨log s + α log pγ (T, y), s, y⟩ | ⟨s, y⟩ ∈ bi=1 Wi }
S
4
▷ Store all finished ones from top-b candidates into C
5 C ← {⟨s, y⟩ | ⟨s′ , s, y⟩ ∈ topb {W}, y|y| = </s>}
▷ Store the top-b unfinished candidates into B
6 B ← topb {⟨s, y⟩ | ⟨s′ , s, y⟩ ∈ W, y|y| ̸= </s>}
7 until C has b finished candidates (i.e. |C| = b)
▷ Rerank finished candidates by the linearly combined NMT and LiD log probabilities
8 C ← Sort{⟨log s + α log pγ (T, y), y⟩ | ⟨s, y⟩ ∈ C}
9 return C
To inform the beam search process of the desired language, we borrow an off-the-shelf
Language Identification (LiD) model to score the running beam search candidates with their
probabilities in the correct language. Since the candidates are normally ranked by the NMT model
probabilities, we linearly combine the two log probabilities to ideally find the best candidate in
the correct language.
The detailed algorithm is illustrated in Algorithm 2. For each step, we first pre-select top-w
candidates from each beam. Then we sort all b · w active candidates by the linearly combined
NMT and LiD log probabilities, where we tune the linear coefficient α on the dev set. Same
as the Fairseq [37] implementation, we only store the finished ones within the top-b candidates,
meanwhile save the top-b active candidates into the beam for the next step8 . The decoding process
stops when we have found b finished candidates, and at the end of the generation, we again rerank
all finished candidates with the linearly combined log probabilities.
8
We only store the NMT model score instead of the linearly combined one to avoid overcounting LiD scores.
71
Design Choice and Speed Concern We only pre-select b · w candidates for the LiD scoring,
instead of considering all possible continuations (as in Equation 2.5), simply because we could
not afford to run LiD model on all b · |V| candidates.
Even though in our experiments we only pre-select the top-2 continuations from each beam
(i.e. w = 2), the major slow down of the LIBS algorithm is still the un-BPE operation and LiD
scoring on line 4 of Algorithm 2.
To speed up the LIBS algorithm, we use the FastText LiD model since it is both fast and
accurate (in our case on generation prefixes). With its help, LIBS is only 7.5 times slower than
the Fairseq beam search decoding on a single CPU and 3.5 times slower with parallelized LiD
scoring on 20 CPUs.
6.5 Experiment Results

To fully verify the performance of the LIBS algorithm, we compare LIBS against the baseline
beam search decoding on both WMT and OPUS-100 datasets.
6.5.1 WMT Results

Tuning the Linear Coefficient α We tune the linear coefficient α on the dev set. As shown
in Table 6.8, any α value from 0.8 to 1.2 performs similarly well. Because the linear coefficient
α controls the weight of the LiD model score, as α increases, the off-target rate monotonically
drops. We use α = 0.9 for all the experiments on the WMT dataset.
Multilingual Beam Search Curse As illustrated before, the beam search curse exists in Multi-
lingual NMT models predominantly due to the increasing off-target errors with larger beam sizes.
As shown in Table 6.9, LIBS successfully breaks the beam search curse by preventing off-target
translations.
Zero-Shot Performance Table 6.10 illustrates the full results of LIBS on the WMT dataset.
On average across all zero-shot directions, LIBS improves +1 BLEU score while reducing the
off-target rates from 22.91% to 7.71%. We notice that for many directions the off-target rate is
barely around the error margin of the FastText LiD model, which is 0.81% reported from [69].
It hints that those translation directions do not suffer from off-target errors anymore, and the
72
5.1 57.5%
5.0 55.0%
Off-Target Rate
4.9 52.5%
BLEU Score
BLEU Score
4.8 Off-Target Rate 50.0%
4.7 47.5%
45.0%
4.6
42.5%
4.5
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
Linear Coefficient
Figure 6.3: Translation performance (BLEU and off-target rates) with different α values on
OPUS-100 Fr→De test set.
reported errors are largely due to the LiD model error. Meanwhile, the MNMT model still suffers
from a large amount of off-target errors, especially on Gu→Tr and Tr→Gu translations, which
we hypothesize is due to the extremely low resources for both languages (WMT contains 180K
and 80K parallel data for Tr and Gu respectively.).
6.5.2 OPUS-100 Results

To verify the effectiveness of our LIBS algorithm, we further compare it against the baseline beam
search decoding on the large-scale OPUS-100 dataset, which includes a total of 100 languages.
Different from the WMT experiment, we tune and set α = 1.8 for all directions. This is due
to the challenging nature of the OPUS-100 dataset that it performs very poorly on the zero-shot
directions with a massive amount of off-target translations.: A higher α value for LIBS could
effectively reduce the off-target rates and improve the translation performance. For example, we
plot the performance curve on the OPUS-100 Fr→De test set with increasing α values. It clearly
shows a larger α value would consistently decrease the off-target rates and improve the overall
73
performance (i.e. higher BLEU score)9 .

Zero-shot translation performance of LIBS on the OPUS-100 dataset is shown in Table 6.11.
Across all directions, LIBS consistently improves an average of +0.9 BLEU and reduce the
off-target rates from 65.79% to 25.34%.
Both WMT and OPUS-100 results clearly show our LIBS algorithm notably improves the
zero-shot translation performance by significantly reducing the off-target translations.
9
The flat BLEU curve is due to the one decimal digit precision of sacreBLEU evaluation.
74
b=5 b = 20
Step Beam LiD LogProb Beam LiD LogProb
_Wir De -1.10 _Wir De -1.10
_" En -2.93 _" En -2.93
1 _Jetzt De -3.02 _Jetzt De -3.02
_We En -3.58 _We En -3.58
_„ Ro -3.61 _„ Ro -3.61
_Wir _haben De -1.44 _Wir _haben De -1.44
_Jetzt _haben De -3.48 _Jetzt _haben De -3.48
2 _We _now En -4.10 _We _now En -4.10
_Wir _verfügen De -4.29 _Nous _avons Fr -4.13
_We _have En -4.83 _Wir _verfügen De -4.29
_Wir _haben _jetzt De -2.22 _Wir _haben _jetzt De -2.22
_Wir _haben _nun De -2.82 _Wir _haben _nun De -2.82
3 _Jetzt _haben _wir De -3.60 _Jetzt _haben _wir De -3.60
_We _now _have En -4.20 _We _now _have En -4.20
_Wir _haben _eine De -4.83 _Nous _avons _maintenant Fr -4.36
_Wir _haben _jetzt _eine De -2.65 _Wir _haben _jetzt _eine De -2.65
_Wir _haben _nun _eine De -3.24 _Wir _haben _nun _eine De -3.24
4 _Wir _haben _jetzt _ein De -3.93 _Wir _haben _jetzt _ein De -3.93
_Jetzt _haben _wir _eine De -4.12 _Jetzt _haben _wir _eine De -4.12
_Wir _haben _nun _ein De -4.58 _Wir _haben _nun _ein De -4.58
_ausgezeichnete _ausgezeichnete
5 _Wir _haben _nun _eine De -4.23 _Wir _haben _nun _eine De -4.23
_ausgezeichnete _ausgezeichnete
_hervorragende _hervorragende
_Wir _haben _nun _eine De -4.81 _We _now _have _an En -4.80
_hervorragende _excellent
_Wir _haben _jetzt _eine De -4.95 _Wir _haben _nun _eine De -4.81
_exzellente _hervorragende
Table 6.6: Beam Search case study for b = 5 and b = 20 on one example from WMT Fr→De
test set. English candidates (“→English” errors) are colored in red, while French candidates
(“→Source” errors) are colored in blue.
75
b=5 b = 20
Step Beam LiD LogProb Beam LiD LogProb
_ausgezeichnete _Beziehung _ausgezeichnete _Beziehung
6 _Wir _haben _nun _eine De -4.39 _Wir _haben _nun _eine De -4.39
_ausgezeichnete _Beziehung _ausgezeichnete _Beziehung
_hervorragende _Beziehung _hervorragende _Beziehung
_Wir _haben _nun _eine De -4.98 _We _now _have _an En -4.89
_hervorragende _Beziehung _excellent _relationship
_Wir _haben _jetzt _eine De -5.12 _Wir _haben _nun _eine De -4.98
_exzellente _Beziehung _hervorragende _Beziehung
_Wir _haben _jetzt _eine De -5.73 _Nous _avons _maintenant Fr -5.53
_ausgezeichnete _une _excellent e _relation
7 _Beziehung .“
_Wir _haben _jetzt _eine De -5.89 _Wir _haben _jetzt _ein De -5.54
_ausgezeichnete _Beziehung _ausgezeichnete s
. _Verhältnis
_Wir _haben _jetzt _eine De -5.95 _We _now _have _an En -5.70
_ausgezeichnete _Beziehung _excellent _relationship ."
."
_hervorragende _Beziehung _ausgezeichnete _Beziehung
.“ .“
_Wir _haben _nun _eine De -6.31 _Wir _haben _jetzt _ein De -5.82
_ausgezeichnete _Beziehung _hervorragende s _Verhältnis
.“
Table 6.7: Beam Search case study for b = 5 and b = 20 on one example from WMT Fr→De
test set. English candidates (“→English” errors) are colored in red, while French candidates
(“→Source” errors) are colored in blue. Final translations (at step 7) are in bold, where b = 5
generates a German translation, and b = 20 generates an off-target English translation at the 7th
step.
76
De→Fr Cs→De
Model
BLEU Off-Tgt BLEU Off-Tgt
Baseline 17.3 23.1% 15.4 12.3%
+LIBS, α = 0.7 20.6 2.0% 16.1 1.6%
+LIBS, α = 0.8 20.7 1.4% 16.2 1.6%
+LIBS, α = 0.9 20.7 1.1% 16.2 1.6%
+LIBS, α = 1.0 20.7 0.9% 16.2 1.4%
+LIBS, α = 1.1 20.7 0.9% 16.2 1.4%
+LIBS, α = 1.2 20.7 0.8% 16.2 1.3%
Table 6.8: Tuning the linear coefficient α on WMT De→Fr and Cs→De.
De→Fr Cs→De
Model
b=5 b = 10 b = 20 b=5 b = 10 b = 20
BLEU 17.3 16.1 14.3 15.4 15.0 14.2
Baseline
Off-Tgt 23.1% 31.7% 41.4% 12.3% 17.4% 22.0%
BLEU 20.7 20.9 20.7 16.2 16.4 16.2
+LIBS
Off-Tgt 1.1% 1.2% 1.1% 1.6% 1.7% 1.5%
Table 6.9: LIBS breaks the beam search curse on WMT De→Fr and Cs→De.
Fr-De De-Cs Ro-De Et-Fr Et-Ro Gu-Tr

Zero-Shot Average
← → ← → ← → ← → ← → ← →
BLEU 17.3 11.7 15.4 13.9 17.2 16.1 10.6 13.5 11.9 14.1 0.9 2.0 12.05
Baseline
Off-Tgt 23.1% 39.9% 12.3% 19.0% 1.6% 7.3% 22.5% 26.1% 10.8% 2.0% 73.7% 36.6% 22.91%
BLEU 20.7 15.7 16.2 15.3 17.1 16.5 11.8 14.6 12.4 13.9 1.2 2.3 13.14
+LIBS
Off-Tgt 1.1% 6.5% 1.6% 3.8% 0.6% 0.6% 8.3% 4.6% 2.9% 0.3% 47.2% 15.0% 7.71%
Table 6.10: BLEU score and Off-Target rate of zero-shot translations on WMT dataset.
De-Fr Ru-Fr Nl-De Zh-Ru Zh-Ar Nl-Ar

Zero-Shot Average
← → ← → ← → ← → ← → ← →
BLEU 3.3 3.0 5.4 4.0 5.9 5.2 5.7 11.8 3 11.6 1.2 3.2 5.28
Baseline
Off-Tgt 95.2% 93.7% 68.9% 91.2% 88.4% 89.7% 37.0% 20.2% 89.9% 8.0% 93.0% 14.3% 65.79%
BLEU 5.0 3.8 9.6 4.4 7.4 5.9 5.9 12.2 3.5 11.1 2.5 2.8 6.18
+LIBS
Off-Tgt 46.9% 49.5% 22.5% 41.6% 37.9% 40.0% 5.8% 1.1% 28.3% 0.7% 28.4% 1.4% 25.34%
Table 6.11: BLEU score and Off-Target rate of zero-shot translations on OPUS-100 dataset.
77
Chapter 7: Summary & Future Work
In this dissertation, we explain NMT model behaviors and failures from different aspects, and
propose original and effective solutions to improve the translation performance.
In Chapter 3, we explain the beam search curse through the length bias (i.e. NMT model
tends to generate shorter candidates with larger beam sizes), and propose new rescoring methods
to effectively break the curse.
In Chapter 4, we look closely into the sub-layer functionalities of the NMT decoder. Through
our proposed information probing framework, we illustrate how each individual module utilizes
source and target information and contributes to the translation process. Moreover, we find the
heaviest decoder layers (taking half of the parameters) could be pruned without performance loss.
In Chapter 5, we find an overwhelming amount of off-target errors in the state-of-the-art
Multilingual translation models, and propose a joint representation-level and gradient-level regu-
larization to significantly reduce the off-target errors as well as improve translation performance.
In Chapter 6, we step further to investigate how off-target error emerges during beam search
decoding and propose a novel Language-Informed Beam Search algorithm to significantly reduce
the off-target errors solely in decoding time.
Yet the NMT models still have a long way to go fully explainable. With our spotted deficien-
cies and proposed fixes, we are only able to explain part of the picture, while the Deep Neural
Network model is still too complex to explain the reasoning behind its behaviors.
Neural Networks have shown to be superior in fitting any intrinsic function between the input
and output, given enough data samples. In that sense, the machine translation task could be
seen as one-to-one semantic matching between source and target sentence across the language
barriers. Yet due to the multiplicity of human language, translation is far from being a one-
to-one matching, but rather an exponential many-to-many matching, since one sentence in any
language could be rephrased as an exponential amount of semantically equivalent sentences.
This multiplicity property raises a serious issue to the current NMT training, where one input
sentence will have semantically equivalent yet seemingly different training targets assuming we
have infinite training samples. Different training targets paired with the same input will definitely
confuse the NMT training, as one model can not learn to generate two different targets. Therefore
78
the multiplicity property would never be learned by the standard NMT training. On the other
hand, the current NMT inference also has a serious mode scarcity issue, where the generations
are similar to each other despite the rich diversity existing in human languages.
As a future direction, we hope to build a better NMT model to disentangle Neural Network
learning from the translation objective, where the neural network only serves as a function
approximator and the translation objective is disentangled from the network training. Long ago
in the Statistical Machine Translation era, the translation model is implemented by merging two
models: the translation model and the (target) language model. We believe that implementing
the two models as learned Neural Networks without changing the translation framework would
dramatically improve the explainability without hurting the performance gained from massive
training data. It could also help to learn the multiplicity property, since the NMT inference
becomes picking semantically equivalent words or phrases and concatenating them together with
the DNN-based (target) language model, instead of monotonical left-to-right decoding.
79
Bibliography
[1] P Koehn, FJ Och, and D Marcu, “Statistical phrase-based translation”, in Proceedings of

the 2003 conference of the north american chapter of the association for computational
linguistics on human language technology-volume 1 (Association for Computational Lin-
guistics, 2003), pp. 48–54.
[2] D Bahdanau, K Cho, and Y Bengio, “Neural machine translation by jointly learning to
align and translate”, arXiv preprint arXiv:1409.0473 (2014).
[3] Y Yang, L Huang, and M Ma, “Breaking the beam search curse: a study of (re-) scoring
methods and stopping criteria for neural machine translation”, arXiv preprint arXiv:1808.09582
(2018).
[4] Y Yang, L Wang, S Shi, P Tadepalli, S Lee, and Z Tu, “On the sub-layer functionalities of
transformer decoder”, arXiv preprint arXiv:2010.02648 (2020).
[5] M Johnson, M Schuster, QV Le, M Krikun, Y Wu, Z Chen, N Thorat, F Viégas, M Watten-
berg, G Corrado, et al., “Google’s multilingual neural machine translation system: enabling
zero-shot translation”, Transactions of the Association for Computational Linguistics 5,
339–351 (2017).
[6] J Gehring, M Auli, D Grangier, D Yarats, and YN Dauphin, “Convolutional sequence to
sequence learning”, in Proc. of icml (2017).
[7] A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, L Kaiser, and I
Polosukhin, “Attention is all you need”, arXiv preprint arXiv:1706.03762 (2017).
[8] N Arivazhagan, A Bapna, O Firat, D Lepikhin, M Johnson, M Krikun, MX Chen, Y Cao,
G Foster, C Cherry, et al., “Massively multilingual neural machine translation in the wild:
findings and challenges”, arXiv preprint arXiv:1907.05019 (2019).
[9] I Sutskever, O Vinyals, and QV Le, “Sequence to sequence learning with neural networks”,
Advances in neural information processing systems 27 (2014).
[10] M Ranzato, S Chopra, M Auli, and W Zaremba, “Sequence level training with recurrent
neural networks”, arXiv preprint arXiv:1511.06732 (2015).
80
[11] S Shen, Y Cheng, Z He, W He, H Wu, M Sun, and Y Liu, “Minimum risk training for
neural machine translation”, arXiv preprint arXiv:1512.02433 (2015).
[12] S Wiseman and AM Rush, “Sequence-to-sequence learning as beam-search optimization”,
arXiv preprint arXiv:1606.02960 (2016).
[13] P Koehn, H Hoang, A Birch, C Callison-Burch, M Federico, N Bertoldi, B Cowan, W Shen,
C Moran, R Zens, et al., “Moses: open source toolkit for statistical machine translation”,
in Proceedings of the 45th annual meeting of the association for computational linguistics
companion volume proceedings of the demo and poster sessions (2007), pp. 177–180.
[14] Y Wu, M Schuster, Z Chen, QV Le, M Norouzi, W Macherey, M Krikun, Y Cao, Q Gao, K
Macherey, et al., “Google’s neural machine translation system: bridging the gap between
human and machine translation”, arXiv preprint arXiv:1609.08144 (2016).
[15] P Koehn and R Knowles, “Six challenges for neural machine translation”, arXiv preprint
arXiv:1706.03872 (2017).
[16] L Huang, K Zhao, and M Ma, “When to finish? optimal beam search for neural text
generation (modulo beam size)”, arXiv preprint arXiv:1809.00069 (2018).
[17] M Ott, M Auli, D Grangier, and M Ranzato, “Analyzing uncertainty in neural machine
translation”, in International conference on machine learning (PMLR, 2018), pp. 3956–
3965.
[18] W He, Z He, H Wu, and H Wang, “Improved neural machine translation with smt features”,
in Thirtieth aaai conference on artificial intelligence (2016).
[19] K Papineni, S Roukos, T Ward, and WJ Zhu, “Bleu: a method for automatic evaluation
of machine translation”, in Proceedings of the 40th annual meeting on association for
computational linguistics (Association for Computational Linguistics, 2002), pp. 311–318.
[20] X Shi, K Knight, and D Yuret, “Why neural translations are the right length”, in Proceed-
ings of the 2016 conference on empirical methods in natural language processing (2016),
pp. 2278–2282.
[21] K Murray and D Chiang, “Correcting length bias in neural machine translation”, arXiv
preprint arXiv:1808.10006 (2018).
[22] G Klein, Y Kim, Y Deng, J Senellart, and AM Rush, “Opennmt: open-source toolkit for
neural machine translation”, arXiv preprint arXiv:1701.02810 (2017).
81
[23] R Sennrich, B Haddow, and A Birch, “Neural machine translation of rare words with
subword units”, arXiv preprint arXiv:1508.07909 (2015).
[24] E Strubell, P Verga, D Andor, D Weiss, and A McCallum, “Linguistically-informed self-
attention for semantic role labeling”, in Emnlp (2018).
[25] J Devlin, MW Chang, K Lee, and K Toutanova, “Bert: pre-training of deep bidirectional
transformers for language understanding”, in Naacl (2019).
[26] A Raganato, J Tiedemann, et al., “An analysis of encoder representations in transformer-
based machine translation”, in Blackboxnlp workshop (2018).
[27] B Yang, L Wang, DF Wong, LS Chao, and Z Tu, “Assessing the ability of self-attention
networks to learn word order”, arXiv preprint arXiv:1906.00592 (2019).
[28] G Tang, R Sennrich, and J Nivre, “Encoders help you disambiguate word senses in neural
machine translation”, arXiv preprint arXiv:1908.11771 (2019).
[29] J Li, Z Tu, B Yang, MR Lyu, and T Zhang, “Multi-head attention with disagreement
regularization”, EMNLP (2018).
[30] E Voita, D Talbot, F Moiseev, R Sennrich, and I Titov, “Analyzing multi-head self-
attention: specialized heads do the heavy lifting, the rest can be pruned”, in Acl (2019).
[31] P Michel, O Levy, and G Neubig, “Are sixteen heads really better than one?”, Advances
in neural information processing systems 32 (2019).
[32] G Tang, R Sennrich, and J Nivre, “Understanding neural machine translation by simplifi-
cation: the case of encoder-free models”, arXiv preprint arXiv:1907.08158 (2019).
[33] A Radford, J Wu, R Child, D Luan, D Amodei, and I Sutskever, “Language models are
unsupervised multitask learners”, arXiv (2019).
[34] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition”, in Cvpr
(2016).
[35] JL Ba, JR Kiros, and GE Hinton, “Layer normalization”, arXiv (2016).
[36] Z Tu, Z Lu, Y Liu, X Liu, and H Li, “Modeling coverage for neural machine translation”,
[37] M Ott, S Edunov, A Baevski, A Fan, S Gross, N Ng, D Grangier, and M Auli, “Fairseq: a
fast, extensible toolkit for sequence modeling”, in Naacl-hlt (2019).
82
[38] DP Kingma and J Ba, “Adam: a method for stochastic optimization”, in Iclr (2015).
[39] Y Belinkov, N Durrani, F Dalvi, H Sajjad, and J Glass, “What do neural machine transla-
tion models learn about morphology?”, in Acl (2017).
[40] I Tenney, P Xia, B Chen, A Wang, A Poliak, RT McCoy, N Kim, B Van Durme, SR
Bowman, D Das, et al., “What do you learn from context? probing for sentence structure
in contextualized word representations”, in Iclr (2019).
[41] P Vincent, H Larochelle, I Lajoie, Y Bengio, PA Manzagol, and L Bottou, “Stacked
denoising autoencoders: learning useful representations in a deep network with a local
denoising criterion.”, Journal of machine learning research 11 (2010).
[42] O Vinyals, Ł Kaiser, T Koo, S Petrov, I Sutskever, and G Hinton, “Grammar as a foreign
language”, in Nips (2015).
[43] FJ Och and H Ney, “A systematic comparison of various statistical alignment models”,
Computational Linguistics (2003).
[44] Y Liu and M Sun, “Contrastive unsupervised word alignment with non-local features”, in
Aaai (2015).
[45] B Zhang, I Titov, and R Sennrich, “Improving deep transformer with depth-scaled initial-
ization and merged attention”, arXiv preprint arXiv:1908.11365 (2019).
[46] A Conneau, G Kruszewski, G Lample, L Barrault, and M Baroni, “What you can cram
into a single $ &!#* vector: probing sentence embeddings for linguistic properties”, in
Acl (2018).
[47] X Shi, I Padhi, and K Knight, “Does string-based neural MT learn source syntax?”, in
Emnlp (2016).
[48] A Bisazza and C Tump, “The lazy encoder: a fine-grained analysis of the role of morphol-
ogy in neural machine translation”, in Emnlp (2018).
[49] T Blevins, O Levy, and L Zettlemoyer, “Deep rnns encode soft hierarchical syntax”, in
Acl (2018).
[50] I Tenney, D Das, and E Pavlick, “BERT rediscovers the classical nlp pipeline”, in Acl
(2019).
[51] S Jain and BC Wallace, “Attention is not explanation”, arXiv preprint arXiv:1902.10186
(2019).
83
[52] Y Belinkov, L Màrquez, H Sajjad, N Durrani, F Dalvi, and J Glass, “Evaluating layers of
representation in neural machine translation on part-of-speech and semantic tagging tasks”,
[53] F Dalvi, N Durrani, H Sajjad, Y Belinkov, and S Vogel, “Understanding and improving
morphological learning in the neural machine translation decoder”, in Ijcnlp (2017).
[54] D Emelin, I Titov, and R Sennrich, “Widening the representation bottleneck in neural
machine translation with lexical shortcuts”, in Proceedings of the fourth conference on
machine translation (volume 1: research papers) (2019).
[55] Z Wang, ZC Lipton, and Y Tsvetkov, “On negative interference in multilingual language
models”, in Emnlp (2020), pp. 4438–4450.
[56] Y Wang, J Zhang, F Zhai, J Xu, and C Zong, “Three strategies to improve one-to-many
multilingual translation”, in Proceedings of the 2018 conference on empirical methods in
natural language processing (2018), pp. 2955–2960.
[57] Y Tang, C Tran, X Li, PJ Chen, N Goyal, V Chaudhary, J Gu, and A Fan, “Multilin-
gual translation with extensible multilingual pretraining and finetuning”, arXiv preprint
arXiv:2008.00401 (2020).
[58] J Gu, Y Wang, K Cho, and VO Li, “Improved zero-shot neural machine translation via
ignoring spurious correlations”, arXiv preprint arXiv:1906.01181 (2019).
[59] B Zhang, P Williams, I Titov, and R Sennrich, “Improving massively multilingual neural
machine translation and zero-shot translation”, arXiv preprint arXiv:2004.11867 (2020).
[60] R Sennrich, B Haddow, and A Birch, “Improving neural machine translation models with
monolingual data”, arXiv preprint arXiv:1511.06709 (2015).
[61] X Wang, Y Tsvetkov, and G Neubig, “Balancing training for multilingual neural machine
translation”, arXiv preprint arXiv:2004.06748 (2020).
[62] T Yu, S Kumar, A Gupta, S Levine, K Hausman, and C Finn, “Gradient surgery for
multi-task learning”, arXiv preprint arXiv:2001.06782 (2020).
[63] Z Wang, Y Tsvetkov, O Firat, and Y Cao, “Gradient vaccine: investigating and improving
multi-task optimization in massively multilingual models”, arXiv preprint arXiv:2010.05874
(2020).
84
[64] X Wang, A Bapna, M Johnson, and O Firat, “Gradient-guided loss masking for neural
machine translation”, arXiv preprint arXiv:2102.13549 (2021).
[65] Y Wang, C Zhai, and HH Awadalla, “Multi-task learning for multilingual neural machine
translation”, arXiv preprint arXiv:2010.02523 (2020).
[66] T Kudo and J Richardson, “Sentencepiece: a simple and language independent subword
tokenizer and detokenizer for neural text processing”, arXiv preprint arXiv:1808.06226
(2018).
[67] M Post, “A call for clarity in reporting bleu scores”, arXiv preprint arXiv:1804.08771
(2018).
[68] A Joulin, E Grave, P Bojanowski, M Douze, H Jégou, and T Mikolov, “Fasttext.zip:
compressing text classification models”, arXiv preprint arXiv:1612.03651 (2016).
[69] Y Yang, A Eriguchi, A Muzio, P Tadepalli, S Lee, and H Hassan, “Improving multilingual
translation by representation and gradient regularization”, arXiv preprint arXiv:2109.04778
(2021).

Yili Nyang 2022

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yili Nyang 2022

Uploaded by

Copyright:

Available Formats

AN ABSTRACT OF THE DISSERTATION OF

Title: Explaining and Improving Neural Machine Translation

Oregon State University

Presented June 1, 2022

Major Professor, representing Computer Science

Head of the School of Electrical Engineering and Computer Science

Dean of the Graduate School

Yilin Yang, Author

I am ultimately grateful to complete my Ph.D. degree in the company of my advisors, friends,

3 Explaining and Improving the Beam Search Curse 7

4 Explaining and Improving the Transformer Decoder 19

5 Explaining and Improving Multilingual Off-Target Translation 40

6 Explaining and Improving Multilingual Beam Search Decoding 62

6.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Summary & Future Work 77

2.1 The Transformer model architecture (image from [7]). . . . . . . . . . . . . . . 4

4.1 A sub-layer splitting of Transformer decoder with respect to their functionalities. 22

4.7 Effects of decoder depths on SEM behaviors on the En-De task. . . . . . . . . 36

5.1 The training loss curve of TGP on WMT-10. . . . . . . . . . . . . . . . . . . . 51

5.2 The test BLEU curves of TGP on WMT-10. . . . . . . . . . . . . . . . . . . . 52

2.0.1 NMT Training

where Pθ denotes the NMT model probabilities.

2.0.2 NMT Model

2.0.3 NMT Inference & Beam Search Decoding

Figure 2.1: The Transformer model architecture (image from [7]).

yi ← argmax p(wordn |x1..m , y1..i−1 ) (2.2)

B0 ← {⟨0.0, <s>⟩} (2.4)

2.0.4 Multilingual NMT Training

following cross-entropy loss:

log Pθ (ytj | xi , ⟨j⟩, y1..(t−1)

Chapter 3: Explaining and Improving the Beam Search Curse

length ratio / brevity penalty

3.2 Beam Search Curse

0 50 100 150 200 250 300 350

3.3 Rescoring Methods

3.3.1 Previous Rescoring Methods

Ŝlength_norm (x, y) = S(x, y)/|y| (3.4)

This is the most widely used rescoring method since it is hyperparameter-free.

ŜWR (x, y) = S(x, y) + r · |y| (3.5)

ŜBWR (x, y) = S(x, y) + r · L(x, y) (3.6)

3.3.2 Rescoring with Length Prediction

3.3.2.1 Bounded Word-Reward

L∗ (x, y) = min{|y|, Lpred (x)} (3.7)

3.3.2.2 Bounded Adaptive-Reward

Its score is very similar to (Equation 3.6):

Ŝbp (x, y) = log bp + S(x, y)/|y| (3.11)

which is in the same form as the BLEU score (Equation 3.1).

3.4 Stopping Criteria

3.4.1 Conventional Stopping Criteria

3.4.2 Optimal Stopping Criteria

St,0 Lpred St,0

3.5.1 Parameter Tuning and Results

Chapter 4: Explaining and Improving the Transformer Decoder

• Higher-layer SEMs accomplish the functionality of word alignment, while lower-layer

4.2 Transformer Decoder

Cnd = L N A T T (Qnd , Knd , Vdn ) + Ln−1

Lnd = L N F F N (Dnd ) + Dnd

The top decoder representation LN

4.3 Sub-Layer Partition

Add & Norm

Add & Norm

4.4 Research Questions

4.5.1 Representation Evolution Across Layers

Output 0.3 0.2 0.2 0.5 0.4 0.1 0.3