Professional Documents
Culture Documents
Dmitry Lepikhin Melvin Johnson Maxim Krikun Mia Xu Chen Yuan Cao
Google AI
2
strained settings. Here we approach this challenge
from the opposite end of the spectrum, determin-
ing what is possible with our current technology
and understanding of NMT and ML, and probing
for the effect of various design strategies in an ex-
treme, real-world setting.
Our desired features of a truly multilingual
translation model can be characterized as:
3
3.1 Our Data Setup exhibits a power-law in terms of number of train-
Our problem significantly extends those studied ing examples across language pairs, and spans a
by these previous works: we study multilingual rich set of domains with varying noise levels—
NMT on a massive scale, using an in-house cor- making our overall attempt as realistic as possible.
pus generated by crawling and extracting parallel Please see Table 8 in the Appendix for the full list
sentences from the web (Uszkoreit et al., 2010). of languages.
This corpus contains parallel documents for 102 3.2 Experiments and Baselines
languages, to and from English, containing a to-
Throughout this paper we perform several exper-
tal of 25 billion sentence pairs.4 The number
iments on the training dataset described above, to
of parallel sentences per language in our cor-
highlight challenges associated with different as-
pus ranges from around tens of thousands to
pects of multilingual models. We first train ded-
almost 2 billion. Figure 1 illustrates the data
icated bilingual models on all language pairs to
distribution across language pairs for all 204 lan-
ground our multilingual analyses. We perform
guage pairs we study. The following specifics of
all our experiments with variants of the Trans-
our dataset distinguish our problem from previous
former architecture (Vaswani et al., 2017), using
work on multilingual NMT:
the open-source Lingvo framework (Shen et al.,
2019). For most bilingual experiments, we use
• Scale: Even on our lowest resource lan-
a larger version of Transformer Big (Chen et al.,
guages we often exceed the amount of data
2018a) containing around 375M parameters, and a
available in a majority of the previously stud-
shared source-target sentence-piece model (SPM)
ied datasets. Given the amount of data, tech-
(Kudo and Richardson, 2018) vocabulary with
niques developed in relatively low-resource
32k tokens. We tune different values of regular-
setups may not be as effective.
ization techniques (e.g. dropout (Srivastava et al.,
• Distribution: The availability of quality par- 2014)) depending on the dataset size for each lan-
allel data follows a sharp power law, and data guage pair. For most medium and low resource
becomes increasingly scarce as we expand languages we also experiment with Transformer
the scope of the system to more languages. Base. All our models are trained with Adafactor
There is a discrepancy of almost 5 orders of (Shazeer and Stern, 2018) with momentum factor-
magnitude between our highest and our low- ization, a learning rate schedule of (3.0, 40k),5 and
est resource languages. Balancing between a per-parameter norm clipping threshold of 1.0.
these different language pairs now becomes For Transformer Base models, we use a learning
a very challenging problem. rate schedule of (2.0, 8k), unless otherwise neces-
sary for low-resource experiments.
• Domain and Noise: Having been mined from In order to minimize confounding factors and
the web, our dataset spans a vast range of control the evaluation set size and domain, we cre-
domains. However, such web crawled data ated our validation (development) and test sets as
is also extremely noisy; this problem gets multi-way aligned datasets containing more than
worse in the multilingual setting where the 3k and 5k sentence pairs respectively for all lan-
level of noise might be different across dif- guages. For our bilingual baselines, BLEU scores
ferent languages. While clean sources of par- are computed on the checkpoint with the best val-
allel data are also available, they are often idation set performance, while we compute test
limited to narrow domains and high resource BLEU on the final checkpoint (after training for
languages. around 1M steps) for our multilingual models, on
true-cased output and references6 . For all our
To summarize, the training data used in our 5
(3.0, 40k) schedule is the shorthand for a learning rate
study is drawn from 102 languages (+ English), of 3.0, with 40k warm-up steps for the schedule, which is de-
cayed with the inverse square root of the number of training
4
Limited to approximately this amount for experimenta- steps after warm-up.
6
tion. We used an in-house implementation of mteval-v13a.pl
4
En→Any High 25 Med. 52 Low 25
Bilingual 29.34 17.50 11.72
Any→En High 25 Med. 52 Low 25
Bilingual 37.61 31.41 21.63
5
transfer dynamics towards universal NMT. 2017). We compare the performance of two train-
ing approaches against bilingual baselines follow-
4.1 Transfer and Interference ing two strategies:
Multitask learning (Caruana, 1997) has been suc- (i) all the available training data is combined as
cessfully applied to multiple domains, includ- it is, with the data distribution in Figure 1,
ing NLP, speech processing, drug discovery and
many others (Collobert and Weston, 2008; Deng (ii) we over-sample (up-sample) low-resource
et al., 2013; Ramsundar et al., 2015; Maurer et al., languages so that they appear with equal
2016; Ruder, 2017; Kaiser et al., 2017). Other probability in the combined dataset.
problems closely related to multitask learning in- In order to guide the translation with the in-
clude zero or few-shot learning (Lake et al., 2011; tended target language, we pre-pend a target lan-
Romera-Paredes and Torr, 2015; Vinyals et al., guage token to every source sequence to be trans-
2016; Pan and Yang, 2010), meta-learning and lated (Johnson et al., 2017). Further, to study
life-long learning (Thrun and Mitchell, 1995; Sil- the effect of transfer and interference at its limit,
ver et al., 2013; Chen and Liu, 2016; Parisi et al., we shared a single encoder and decoder across
2018; Golkar et al., 2019; Sodhani et al., 2018; all the language pairs. During training, mini-
Lopez-Paz et al., 2017). Although these learning batches are formed by randomly sampling exam-
paradigms make different assumptions about the ples from the aggregated dataset following strate-
underlying problem setting, they share the com- gies (i) or (ii), as described above. We train a
mon goal of leveraging the inductive bias and reg- single Transformer-Big with a shared vocabulary
ularization from a set of tasks to benefit another of 64k tokens, all Transformer dropout options
set of tasks. This inductive transfer could be par- turned on with probability 0.1, and the same val-
allel or sequential. ues as the bilingual baselines used for other hyper-
Multilingual NMT has been studied under parameters. We use batch sizes of 4M tokens for
many of these settings to various extents. Most all our multilingual models to improve the rate
existing literature deals with sequential transfer, of convergence. All Transformer-Big runs utilize
where the goal is to leverage a set of high re- data parallelism over 64 TPUv3 chips. The results
source tasks that are already mastered, to improve are depicted in Figure 3.
the performance on a new (predominantly) low re- The performance of these two models high-
source task (Zoph et al., 2016). We consider the lights the trade-off between transfer and interfer-
parallel learning problem, where the goal is to ence. If we sample equally from all datasets by
learn a single multi-task model trained concur- over-sampling low-resource languages (strategy
rently on all tasks and hence is capable of per- (ii)), we maximize transfer (right-most portion of
forming all tasks simultaneously once the train- Figure 3) and beat our bilingual baselines by large
ing is accomplished. margins, especially in the Any→En direction.
In this section we investigate the effect of data However, this also has the side-effect of signifi-
imbalance across languages on these learning dy- cantly deteriorated performance on high resource
namics, particularly through the lens of trans- languages (left-most portion of Figure 3). On the
fer and interference (Caruana, 1997; Rosenstein other hand, sampling based on the true data dis-
et al., 2005). Reiterating from Section 2, two tribution (strategy (i)) retains more performance
desired characteristics of our universal machine on high resource languages, at the cost of sacrific-
translation model are (1) maximum (positive) ing performance on low resource languages. We
transfer to low-resource languages and (2) min- also note that the transfer-interference trade-
imum interference (negative transfer) for high- off is more pronounced in the Any→En direc-
resource languages. Now let us examine the inter- tion: the cost-benefit trade-off between low and
action between variables considered. For the base- high-resource languages is more severe than
line experiment, we start by following common that for the En→Any direction.
conventions in literature on multilingual NMT Another interesting finding is the performance
(Firat et al., 2017; Lee et al., 2017; Johnson et al., deterioration on high resource languages when
6
hundreds of languages from different families are
jointly considered.
these directions is depicted in separate plots to ing to language pair l being proportional to plT ,
highlight differences. Results are reported rela- where T is the sampling temperature. As a result,
tive to those of the bilingual baselines (2). Per- T = 1 corresponds to true data distribution and
formance on individual language pairs is reported T = 100 corresponds to (almost) equal number
using dots and a trailing average is used to show of samples for each language (close to a uniform
the trend. The colors correspond to the following distribution with over-sampled low-resource lan-
sampling strategies: (i) Blue: original data distri- guages). Please see Figure 4 for an illustration of
bution, (ii) Green: equal sampling from all lan- the effect of temperature based sampling overlaid
guage pairs. Best viewed in color. on our dataset distribution.
7
En→Any High 25 Med. 52 Low 25
Bilingual 29.34 17.50 11.72
T=1 28.63 15.11 6.24
T=100 27.20 16.84 12.87
T=5 28.03 16.91 12.75
Any→En High 25 Med. 52 Low 25
Bilingual 37.61 31.41 21.63
T=1 34.60 27.46 18.14
T=100 33.25 30.13 27.32
T=5 33.85 30.25 26.96
8
learning strategies refine the ratio of tasks simul-
taneously with training, based on metrics derived
from the current state of the learner (Graves et al.,
2017). In the context of NMT, (Kumar et al.,
2019) learn a RL agent to schedule between dif-
ferent noise levels in the training data, while (Pla-
tanios et al., 2019) combined heuristics into a data
curriculum. (Kiperwasser and Ballesteros, 2018)
designed schedules favoring one target language
and (Jean et al., 2018) learn an adaptive scheduler
for multilingual NMT. Similar approaches could
be extended in the context of learning different
language pairs in massively multilingual settings.
9
trained on all language pairs against two mod-
els: (i) An identical model trained on all En→Any
tasks, the one-to-many setup, and (ii) An identi-
cal model trained on all Any→En tasks, the many-
to-one setup. We depict the performance of these
models in Figure 7 (and summarize in Table 3).
We notice that the many-to-one Any→En model
achieves huge improvements over our bilingual
baselines for all low-resource languages (right-
most portion of Figure 7). On the other hand, for
the one-to-many En→Any model, we notice lesser
deterioration in the performance on high resource
languages, while the performance on low resource
languages does not improve by much.
This discrepancy between the transfer for seeing much more English source data, there is
Any→En and En→Any can be better understood little to no transfer occurring at the task/decoder
under the characterization of the many-to-one level. On the other hand, it is much easier for
Any→En model as a multi-domain model, where many-to-one multilingual models to reap the ben-
each source language constitutes a separate do- efits of joint training when the inputs are from dif-
main, and the one-to-many En→Any model as a ferent domains: the output distribution remains
multi-task model, with each target language rep- the same, hence learning a much stronger En-
resenting a separate task (Section 2). This for- glish language model, without any interference
mulation of multilingual NMT helps explain the from the other target languages or tasks. This is
aforementioned observations, which suggest that also reflected in other works on low-resource ma-
multilingual models might be more amenable to chine translation where Any→En typically bene-
transfer across input domains than transfer across fits the most (Zoph et al., 2016; Nguyen and Chi-
tasks. Simple joint training does not do much ang, 2017; Gu et al., 2018a,b).
to benefit one-to-many multilingual NMT; while Another strong indicator of transfer in multi-
some improvements may arise from the model lingual models is the quality on zero-shot trans-
10
De → F r Be → Ru Y i → De F r → Zh Hi → F i Ru → F i
10 langs 11.15 36.28 8.97 15.07 2.98 6.02
102 langs 14.24 50.26 20.00 11.83 8.76 9.06
Table 4: Effect of increasing the number of languages on the zero-shot performance of multilingual
models.
lation. Multilingual models possess a unique ad- creases as we move from the 10 language model
vantage over single task models, in that they are to the 102 language model, possibly due to the
capable of translating between any pair of sup- regularization effect in the capacity constrained
ported input and output languages, even when setting, similar to what was observed in (Aharoni
no direct parallel data is available (Firat et al., et al., 2019). This also indicates why methods that
2016b). However, without supervision from par- explicitly force languages to share the same rep-
allel data between non-English language pairs, resentation space (Arivazhagan et al., 2019; Gu
zero-shot translation quality often trails the per- et al., 2019; Lu et al., 2018) may be key to im-
formance of pivoting/bridging through a common proving zero-shot translation performance.
language (Johnson et al., 2017). Given the lack We next delve into pre-processing and vocab-
of transfer across different En→Any translation ulary generation when dealing with hundreds of
tasks, it isn’t hard to imagine why transferring languages.
across Yy→En and En→Xx , to learn Yy→Xx , is
an even more challenging task. 5 Pre-processing and Vocabularies
Enabling direct translation between arbitrary Pre-processing and vocabulary construction lay
languages has been widely studied, and has the central to the generalization ability of natural lan-
potential to obviate the need for two-step pivoting guage processing systems.
which suffers from higher latency and accumu- To generalize to unseen examples, machine
lated errors. The most effective approach has been learning algorithms must decompose the input
to simply synthesize parallel data (Firat et al., data into a set of basic units (e.g. pixels, char-
2016b; Chen et al., 2017, 2018b) and incorporate acters, phonemes etc.) or building blocks. For
that into the training process. However, this two- text, this set of basic units or building blocks is
stage approach becomes intractable when dealing referred to as the vocabulary. Given some text
with a large number of languages; the amount and a vocabulary, a segmentation algorithm or a
of synthesized data required to enable zero-shot pre-processor is applied to fragment the text to its
translation grows quadratically with the number building blocks upon which the learner may then
of languages. More recent work has demon- apply inductive reasoning.
strated that direct translation quality may be im- In essence, a properly defined vocabulary needs
proved even in a zero-shot setting, by incorporat- to (i) maintain a high coverage by being able to
ing more languages (Aharoni et al., 2019), adding compose most text and produce minimal number
regularization to promote cross-lingual transfer of Out-of-Vocabulary (OOV) tokens during seg-
(Arivazhagan et al., 2019; Gu et al., 2019; Al- mentation, (ii) have tractable size to limit compu-
Shedivat and Parikh, 2019), and modeling strate- tational and spatial costs, and (iii) operate at the
gies that encourage shared cross-lingual represen- right level of granularity to enable inductive trans-
tations (Lu et al., 2018; Escolano et al., 2019). fer with manageable sequence lengths which in-
We report the zero-shot performance of our 10 crease computational costs.
language and 102 language models, discussed in Early NMT models operated at the word level
Section 4.2, on selected language pairs in Table (Sutskever et al., 2014; Cho et al., 2014; Bahdanau
4. We observe that zero-shot performance on sim- et al., 2014). Coverage issues arising from diffi-
ilar languages, in this case Be→Ru and Yi→De , culty capturing all words of a language within a
is extremely high. We also notice that the zero- limited vocabulary budget promoted the develop-
shot performance for most language pairs in- ment of character level systems (Ling et al., 2015;
11
Luong and Manning, 2016; Chung et al., 2016;
Costa-jussà and Fonollosa, 2016; Lee et al., 2017;
Cherry et al., 2018). These trivially achieve high
coverage, albeit with the downside of increased
computational and modeling challenges due to in-
creased sequence lengths. Sub-word level vocab-
ularies (Sennrich et al., 2016) have since found a
middle ground and are used in most state-of-the-
art NMT systems.
Constructing vocabularies that can model hun-
dreds of languages with vast number of charac-
ter sets, compositional units, and morphological
variance is critical to developing massively multi-
lingual systems, yet remains challenging. Early
multilingual NMT models utilized separate vo-
cabularies for each language (Dong et al., 2015;
Luong et al., 2015; Firat et al., 2016a); later ones
used shared multilingual vocabularies (Sennrich
et al., 2016; Ha et al., 2016c; Johnson et al., 2017)
which are shared across languages. Recently, hy-
brid (Shared + Separate) multilingual vocabular-
Figure 8: Average number of sentence-piece to-
ies and approaches for adding new languages to
kens per sentence on random samples drawn from
existing vocabularies (Lakew et al., 2018) have
the training set. We compare the increase in num-
also been explored.
ber of tokens per sentence for different languages,
In this section we describe the simplistic ap- when moving from a standard monolingual vo-
proach to multilingual vocabulary construction cabulary with 32k tokens to a multilingual vo-
used in our setup, inspect implicit characteris- cabulary for which we vary the vocabulary size
tics of such vocabularies, and finally evaluate (size={32k, 64k, 128k} tokens) and the vocabu-
the downstream effects on multilingual machine lary sampling temperature (T = {1, 5, 100}).
translation.
5.1 Vocabulary Construction and ficient coverage over the alphabet from various
Characteristics scripts, OOV rates will be low. For SPMs, this is
We construct all our vocabularies using Sen- tuned using the character_coverage option and we
tence Piece Model (SPM) (Kudo and Richard- to a high value of 1 − (5 ∗ 10−6 ) which yields an
son, 2018) to remove complexity that may arise alphabet size of around 12000, ensuring very low
from language specific pre-processing (e.g. to- unknown rates for all the languages in our study.
kenization, special character replacements etc.). While ensuring low unknown rates, the shift to-
The large number of languages used in our setup wards a vocabulary which largely consists of char-
makes separate per-language vocabularies infea- acters or short sub-words (in terms of the number
sible. Therefore, we picked shared sub-word vo- of characters that they consist of) results in longer
cabularies. While using a single shared vocabu- sequences after tokenization (Chung et al., 2016;
lary across all languages reduces complexity, it in- Costa-jussà and Fonollosa, 2016). Longer se-
troduces other challenges that we discuss below. quence lengths, however, increase computational
With the large number of scripts introduced in complexity and may also introduce optimization
a multilingual setting, the chance of a vocabu- challenges due to longer range dependencies and
lary producing unknowns (OOV) increases. Note require the learner to model a more complex map-
that since only unrecognized characters will be ping function from a finer grained sequence to
encoded as OOV, if the vocabulary provides suf- meaning. To avoid exceedingly long sequence
12
En→Any High 25 Med. 52 Low 25 En→Any High 25 Med. 52 Low 25
32k Vocab 27.69 16.84 12.90 TV = 1 27.81 16.72 12.73
64k Vocab 28.03 16.91 12.75 TV = 100 27.83 16.86 12.78
Any→En High 25 Med. 52 Low 25 TV = 5 28.03 16.91 12.75
32k Vocab 33.24 29.40 26.18 Any→En High 25 Med. 52 Low 25
64k Vocab 33.85 30.25 26.96 TV = 1 33.82 29.78 26.27
TV = 100 33.70 30.15 26.91
Table 5: Average translation quality (BLEU) of TV = 5 33.85 30.25 26.96
multilingual models using different SPM vocab-
ulary sizes, over different groups of languages. Table 6: Average translation quality (BLEU) of
High 25 refers to the top 25 languages by dataset multilingual models using different sampling tem-
size, while low 25 refers to the bottom 25. peratures for vocabulary generation. High 25
refers to the top 25 languages by dataset size,
while low 25 refers to the bottom 25.
lengths we need to increase the vocabulary size so
that it may include longer tokens (sub-sequence
that consist of longer number of characters). We notice that the model with the smaller 32k to-
Finally, a shared multilingual vocabulary runs ken vocab does noticeably worse on high resource
the risk of favoring some languages over others, languages when translating in both directions,
due to the imbalance of the dataset size the vocab- and on Any→En translation in all resource set-
ulary is extracted. To reduce the effect of imbal- tings. On the other hand, the smaller vocab model
anced dataset size we apply the same temperature performs marginally better when translating into
sampling strategy discussed in Section 4 to Vocab- low resource languages on En→Any . For other
ulary Sampling. medium resource languages, increased vocabulary
In Figure 8 we report the effect of varying the size appears to be better on all directions. Our
vocabulary size (size={32k, 64k, 128k} tokens) results here agree with existing literature (Cherry
and sampling temperature (T = {1, 5, 100}) on et al., 2018; Kreutzer and Sokolov, 2018) suggest-
the average number of tokens per sentence on 10 ing that using smaller sub-word tokens, as is the
indicative languages. We see that it is important case for smaller vocabularies, performs better in
to balance different languages by using a higher low resource settings due to improved generaliza-
sampling temperature of T = 5 or T = 100, in or- tion. Notable languages where the smaller vocab-
der to avoid exceedingly long sequence lengths for ulary performs better include Corsican (co) and
low resource languages. Orthogonally, increasing Uzbek (uz), both low resource languages which
the vocabulary size also helps to reduce sequence have known similar high resource languages to aid
lengths, especially on low resource languages. with generalization.
Here we continue experiments with TV = 5 to We compare the translation quality of models
stay aligned with our sampling approach during that vary only in their vocabulary sampling tem-
NMT training. perature in Table 6. While not very pronounced,
We next train and evaluate a few multilingual we do notice some differences in quality based on
models to study the effect of different vocabulary the vocabulary sampling temperature. Languages
sizes and sampling strategies on overall transla- that perform better with a higher temperature of
tion quality. Following the experiments in Sec- TV = 5 or TV = 100 include low-medium re-
tion 4, we train single multilingual models on all source languages like Mongolian (mn), Zulu (zu),
language pairs, using a data sampling temperature Corsican (co) and Uzbek (uz). These gains may
of T = 5. All our models are Transformer-Big, have originated from two potential factors: (i)
trained with the same set of hyper-parameters, dif- Smaller tokens for high resource languages result
ferent only in terms of the vocabulary used. in better transfer to related low resource languages
Table 5 compares the quality of two models (ii) Better character coverage for languages with
trained using vocabularies of size 32k and 64k. distinct character sets.
13
While the effects of vocabulary are much et al., 2018b). More recently, with the increased
smaller than the trends observed for data sam- number of languages (Qi et al., 2018; Aharoni
pling, failure to ensure careful character coverage et al., 2019), there has been work along the lines
and fair representation from all languages could of applying soft sharing for MT, in the form of
nonetheless significantly impact translation qual- a shared set of meta-learned parameters used to
ity. generate task level parameters (Platanios et al.,
So far in our study, we have presented our anal- 2018). All of these sharing strategies come paired
ysis and experimentation with data, training and with their associated set of transfer-interference
vocabulary in multilingual scenarios. We next an- trade-offs, and regardless of the sharing strategy,
alyze the effect of several architectural choices on in the absence of a complete and accurate atlas of
the quality of our massively multilingual NMT task relatedness, more transferrability also implies
model. more interference. This phenomenon is known
to be a common problem for multitask learning,
6 Modeling and is also related to the stability vs plasticity
dilemma (Carpenter and Grossberg, 1987; Gross-
The quality of any neural network is largely de-
berg, 1988). See (Parisi et al., 2018) for a catego-
pendent on its architecture and parametrization.
rization of such work.
The choice of the model, and the constraints on
its parameters determine the extent of the transfer- Considering the scale of our experimental setup
interference trade-off, the learning dynamics and, and end-goal, combinatorial approaches for pa-
ultimately, the performance limits of the model. rameter sharing do not emerge as plausible so-
In this section we analyze how choices regarding lutions. While “learning to share” approaches
parameter sharing and model capacity can impact (Kang et al., 2011; Ruder et al., 2017; Platanios
translation quality in the setting of massively mul- et al., 2018) are promising alternatives, a more
tilingual NMT. straightforward solution could be implicitly in-
creasing the per-task capacity by increasing over-
6.1 Architecture all model capacity. Next we look into a brute-
force approach to mitigate interference by enhanc-
In recent years, several revolutionary architec-
ing model capacity.
tures have been developed to improve MT quality
(Gehring et al., 2017; Vaswani et al., 2017; Chen
6.2 Capacity
et al., 2018a; Wu et al., 2019). However, in the
context of multilingual NMT, the most common Over the last few years, scaling up model capac-
prior imposed on models is typically in the form of ity has been shown to demonstrate huge improve-
(hard or soft) parameter sharing constraints across ments on several supervised and transfer learning
different languages pairs. In designing a multi- benchmarks, including MT (Brock et al., 2018;
ingual NMT model, we would like to take ad- Devlin et al., 2018; Radford et al.; Shazeer et al.,
vantage of the common structures and features 2018). Scale often comes bundled with new hard-
shared by multiple languages, which imposes con- ware, infrastructure, and algorithmic approaches
straints on the model architecture. These con- meant to optimize accelerator memory utilization
straints range from sharing a fixed length repre- and benefit faster computation, including meth-
sentation across all source and target pairs (Lu- ods like gradient checkpointing and low preci-
ong et al., 2015), sharing small network mod- sion training (Courbariaux et al., 2014; Ott et al.,
ules, for example the attention parameters (Firat 2018), memory efficient optimizers (Shazeer and
et al., 2016a), and sharing everything except the Stern, 2018; Gupta et al., 2018) and frameworks
attention parameters (Ha et al., 2016b; Blackwood supporting model parallelism (Shazeer et al.,
et al., 2018), to sharing all parameters in the model 2018; Harlap et al.; Huang et al., 2018).
across all language pairs (Johnson et al., 2017; Ha While massive models have been shown to im-
et al., 2016c). Some studies also explore partial prove performance in single task settings (Shazeer
parameter sharing strategies tailored for specific et al., 2018), we would expect the gains to be even
architectures (Sachan and Neubig, 2018; Wang larger on a capacity-constrained massively multi-
14
En→Any High 25 Med. 52 Low 25
Bilingual 29.34 17.50 11.72
400M 28.03 16.91 12.75
1.3B Wide 28.36 16.66 11.14
1.3B Deep 29.46 17.67 12.52
Any→En High 25 Med. 52 Low 25
Bilingual 37.61 31.41 21.63
400M 33.85 30.25 26.96
1.3B Wide 37.13 33.21 27.75
1.3B Deep 37.47 34.63 31.21
15
be one of the most important factors determin- both effective and comparable across languages.
ing the extent of the transfer-interference trade- This in and of itself is a hard problem with an
off. However, naively scaling capacity might ever growing list of hundreds of metrics to choose
result in poor transfer performance to low re- from. Oftentimes these metrics vary in their effec-
source languages. For example, our wide Trans- tiveness across languages; WMT shared tasks (Ma
former while significantly improving performance et al., 2018b) report that the specific language,
on the high resource languages, fails to show dataset, and system significantly affect which met-
similar gains in the low resource setting. While ric has the strongest correlation with human rat-
deeper models show great performance improve- ings. When metrics are sufficiently effective
ments, they come bundled with high decoding la- across languages they are not always comparable.
tency, a significantly larger computational foot- N-gram based metrics (Papineni et al., 2002; Dod-
print, and trainability concerns including van- dington, 2002; Wang et al., 2016; Popović, 2015)
ishing/exploding gradients, early divergence, ill- that measure lexical overlap require tokenization
conditioned initial conditions etc. (Hestness et al., which is highly affected by language specific fac-
2017). Further research into various aspects of tors such as alphabet size and morphological com-
scalability, trainability and optimization dynamics plexity. In fact, even within the same language,
is expected to be a fruitful avenue towards univer- tokenization discrepancies pose significant chal-
sal NMT. lenges to reliable evaluation (Post, 2018). Embed-
We next delve into evaluation challenges posed ding based approaches (Stanojevic and Sima’an,
by multilingual models. 2014) may be language agnostic and help address
these issues.
7 Evaluation Equally important to choosing a metric is
Metrics for automatic quality evaluation (Pap- choosing an evaluation set. Most existing met-
ineni et al., 2002) have been critical to the rapid rics are not consistent (Banerjee and Lavie, 2005)
progress in machine translation, by making eval- and for the same model vary based on the domain
uation fast, cheap and reproducible. For multilin- or even the specific sentences that they are being
gual NMT, new concerns arise due to the multi- evaluated on. For example, there are significant
objective nature of the problem and the inherent differences of 3-5 BLEU between the WMT dev
quality trade-offs between languages. and test sets for the same language pair. Such
Inter-language quality trade-offs arise due to consistency issues may further exacerbate the dif-
various decisions made while constructing, train- ficulty of comparing system quality across lan-
ing and selecting a model. When constructing guages if we use different test sets for each lan-
the model, the vocabulary may be constructed guage. This may be addressed by ensuring that
to favor a certain script or group of languages, evaluation is performed on the same corpora that
or the language specific parameters may be un- has been translated to all the languages, i.e. multi-
evenly distributed across languages. During train- way parallel data. Even so, attention must be paid
ing, the optimization settings or the rate at which to the original language (Freitag et al., 2019) and
training data from different languages are sam- domain such data is collected from.
pled strongly influence the eventual performance
on different languages. Finally, when selecting a 8 Open Problems in Massively
checkpoint10 , the model may perform better on Multilingual NMT
high resource languages at later checkpoints but Data and Supervision Whereas we focus solely
may have regressed on low resource languages by on supervised learning, for many low resource
that time due to over-training (or under-training languages it becomes essential to learn from
for the opposing case). Each of these choices nat- monolingual data. There has been a lot of
urally may favor certain languages over others. recent work on incorporating monolingual data
To choose between the aforementioned trade- to improve translation performance in low and
offs, we need a translation quality metric that is zero resource settings, including research on
10
Certain snapshot of the model parameters. back-translation (Sennrich et al., 2015; Edunov
16
et al., 2018), language model fusion (Gulcehre ters (Ha et al., 2016a; Platanios et al., 2018) and
et al., 2015; Sriram et al., 2017), self-supervised models that learn new tasks with high sample ef-
pre-training (Dai and Le, 2015; Ramachandran ficiency (Finn et al., 2017) without forgetting ex-
et al., 2016; Zhang and Zong, 2016; Song et al., isting tasks or languages (Rusu et al., 2016; Kirk-
2019) and unsupervised NMT (Lample et al., patrick et al., 2017; Lakew et al., 2018). When
2017; Artetxe et al., 2017). Languages where scaling to thousands of languages, approaches that
large swathes of monolingual data are not eas- can automatically learn data sampling, curricula,
ily available might require more sample effi- model hyper-parameters and parameters, to train
cient approaches for language modeling, includ- models that quickly adapt to new tasks, is again
ing grounding with visual modalities or sources expected to become increasingly important.
of information (Huang et al., 2016), and learning
from meta-data or other forms of context. Be- Increasing Capacity Increasing the model ca-
ing able to represent and ground information from pacity has been demonstrated to be a sure-shot
multiple modalities in the same representational approach to improving model quality in the pres-
space is the next frontier for ML research. ence of supervised data for several tasks, includ-
The scope of our study is limited to 103 lan- ing Image Generation (Brock et al., 2018), Lan-
guages, a minuscule fraction of the thousands of guage Modeling and transfer learning (Radford
existing languages. The heuristics and approaches et al.; Devlin et al., 2018) and NMT (Shazeer
we develop will be less and less applicable as et al., 2018). Validating this trend, one key re-
we include more languages and incorporate other sult from our study is the need for sufficient
forms of data. As we scale up model sizes and model capacity when training large multitask net-
the number of languages learned within a single works. Other than the systems and engineering
model, approaches that require multi-stage train- challenges (Jouppi et al., 2017; Huang et al., 2018;
ing or inference steps will likely become infea- Shazeer et al., 2018), training deep and high ca-
sible. Work towards better integration of self- pacity neural networks poses significant trainabil-
supervised learning with the supervised training ity challenges (Montavon et al., 2018). A bet-
process is more likely to scale well with larger ter theoretical understanding of the generalization
model capacities and increasing number of lan- ability of deep and wide models (Raghu et al.,
guages. 2017; Arora et al., 2018), trainability challenges
including exploding and vanishing gradients (Glo-
Learning Developing learning approaches that rot and Bengio, 2010; Balduzzi et al., 2017) and
work well for multitask models is essential to im- empirical approaches to counter these challenges
proving the quality of multilingual models. Our (Bapna et al., 2018; Zhang et al., 2019) are critical
analysis from Section 4 demonstrates that even to further scaling of model capacity.
simple heuristics for data sampling/balancing can
significantly improve the extent of transfer and Architecture and Vocabulary Extending exist-
interference observed by individual tasks. How- ing approaches for vocabulary construction and
ever, our heuristic strategy only takes dataset size neural modeling to work better in multitask set-
into account when determining the fraction of tings (Ma et al., 2018a; Houlsby et al., 2019) in or-
per-task samples seen by the model. Research der to strike the right balance between shared and
on exploiting task-relatedness (Lee et al., 2016; task-specific capacity, or learning network struc-
Neubig and Hu, 2018), curriculum learning on ture in order to maximally exploit task relatedness
noisy data (van der Wees et al., 2017; Wang (Li et al., 2019) are exciting avenues for future
et al., 2018a), and automated curriculum learn- research. As we scale up to thousands of lan-
ing from model state (Bengio et al., 2009; Graves guages, vocabulary handling becomes a signifi-
et al., 2017) have demonstrated success in mul- cantly harder challenge. Successfully represent-
titask learning, including for NMT. Other rele- ing thousands of languages might require charac-
vant threads of work include research on meta- ter (Lee et al., 2017; Cherry et al., 2018), byte
learning to learn model hyper-parameters (Nichol (Gillick et al., 2015) or even bit level modeling.
et al., 2018; Baydin et al., 2017), model parame- Modeling extremely long sequences of bit/byte-
17
level tokens would present its own set of model- in zero-shot neural machine translation. arXiv
ing challenges. Finally, while scaling up existing preprint arXiv:1903.07091.
approaches is one way to improve model quality,
some approaches or architectures are more effi- Sanjeev Arora, Nadav Cohen, and Elad Hazan.
cient to train (Shazeer et al., 2017; Vaswani et al., 2018. On the optimization of deep networks:
2017; Wu et al., 2019), more sample efficient, or Implicit acceleration by overparameterization.
faster at inference (Gu et al., 2017; Roy et al., arXiv preprint arXiv:1802.06509.
2018). As models get larger, improving training
and inference efficiency become more important Mikel Artetxe, Gorka Labaka, Eneko Agirre,
to keep training and inference times within rea- and Kyunghyun Cho. 2017. Unsupervised
sonable limits from both a practical and environ- neural machine translation. arXiv preprint
mental perspective (Strubell et al., 2019). arXiv:1710.11041.
18
Graeme Blackwood, Miguel Ballesteros, and Mauro Cettolo, C Girardi, and M Federico. 2012.
Todd Ward. 2018. Multilingual neural machine Wit3: Web inventory of transcribed and trans-
translation with task-specific attention. arXiv lated talks. Proceedings of EAMT, pages 261–
preprint arXiv:1806.03280. 268.
Ondrej Bojar, Rajen Chatterjee, Christian Feder- Mia Xu Chen, Orhan Firat, Ankur Bapna, et al.
mann, et al. 2018a. Proceedings of the third 2018a. The best of both worlds: Combin-
conference on machine translation: Research ing recent advances in neural machine transla-
papers. Belgium, Brussels. Association for tion. In Proceedings of the 56th Annual Meet-
Computational Linguistics. ing of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 76–86,
Ondřej Bojar, Rajen Chatterjee, Christian Feder-
Melbourne, Australia. Association for Compu-
mann, et al. 2017. Findings of the 2017 confer-
tational Linguistics.
ence on machine translation (wmt17). In Pro-
ceedings of the Second Conference on Machine Yun Chen, Yang Liu, Yong Cheng, and Vic-
Translation, pages 169–214. tor OK Li. 2017. A teacher-student frame-
Ondřej Bojar, Rajen Chatterjee, Christian Feder- work for zero-resource neural machine transla-
mann, et al. 2016a. Findings of the 2016 con- tion. arXiv preprint arXiv:1705.00753.
ference on machine translation. In ACL 2016 Yun Chen, Yang Liu, and Victor OK Li. 2018b.
FIRST CONFERENCE ON MACHINE TRANS- Zero-resource neural machine translation with
LATION (WMT16), pages 131–198. The Asso- multi-agent communication game. In Thirty-
ciation for Computational Linguistics. Second AAAI Conference on Artificial Intelli-
Ondřej Bojar, Christian Federmann, Mark Fishel, gence.
Yvette Graham, Barry Haddow, Philipp Koehn, Zhiyuan Chen and Bing Liu. 2016. Lifelong ma-
and Christof Monz. 2018b. Findings of chine learning. Synthesis Lectures on Artificial
the 2018 conference on machine translation Intelligence and Machine Learning, 10(3):1–
(wmt18). In Proceedings of the Third Confer- 145.
ence on Machine Translation: Shared Task Pa-
pers, pages 272–303, Belgium, Brussels. Asso- Colin Cherry, George Foster, Ankur Bapna, Orhan
ciation for Computational Linguistics. Firat, and Wolfgang Macherey. 2018. Revisit-
ing character-based neural machine translation
Ondrej Bojar, Christian Federmann, Barry Had- with capacity and compression. In Proceedings
dow, Philipp Koehn, Matt Post, and Lucia Spe- of the 2018 Conference on Empirical Methods
cia. 2016b. Ten years of wmt evaluation cam- in Natural Language Processing, pages 4295–
paigns: Lessons learnt. In Proceedings of the 4305, Brussels, Belgium. Association for Com-
LREC 2016 Workshop âĂIJTranslation Evalua- putational Linguistics.
tion âĂŞ From Fragmented Tools and Data Sets
to an Integrated EcosystemâĂİ, pages 27–34. Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Andrew Brock, Jeff Donahue, and Karen Si-
Holger Schwenk, and Yoshua Bengio. 2014.
monyan. 2018. Large scale gan training for
Learning phrase representations using rnn
high fidelity natural image synthesis. arXiv
encoder–decoder for statistical machine trans-
preprint arXiv:1809.11096.
lation. In Proceedings of the 2014 Conference
Gail A Carpenter and Stephen Grossberg. 1987. on Empirical Methods in Natural Language
Art 2: Self-organization of stable category Processing (EMNLP), pages 1724–1734, Doha,
recognition codes for analog input patterns. Ap- Qatar. Association for Computational Linguis-
plied optics, 26(23):4919–4930. tics.
Rich Caruana. 1997. Multitask learning. Machine Junyoung Chung, Kyunghyun Cho, and Yoshua
Learning, 28(1):41–75. Bengio. 2016. A character-level decoder with-
19
out explicit segmentation for neural machine for Computational Linguistics and the 7th In-
translation. arXiv preprint arXiv:1603.06147. ternational Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers),
Ronan Collobert and Jason Weston. 2008. A uni- volume 1, pages 1723–1732.
fied architecture for natural language process-
ing: Deep neural networks with multitask learn- Mark Dredze, Alex Kulesza, and Koby Crammer.
ing. In Proceedings of the 25th International 2010. Multi-domain learning by confidence-
Conference on Machine Learning, pages 160– weighted parameter combination. Mach.
167. Learn., 79(1-2):123–149.
Marta R. Costa-jussà and José A. R. Fonollosa. Sergey Edunov, Myle Ott, Michael Auli,
2016. Character-based neural machine transla- and David Grangier. 2018. Understanding
tion. CoRR, abs/1603.00810. back-translation at scale. arXiv preprint
arXiv:1808.09381.
Matthieu Courbariaux, Yoshua Bengio, and Jean-
Pierre David. 2014. Training deep neural Carlos Escolano, Marta R Costa-jussà, and
networks with low precision multiplications. José AR Fonollosa. 2019. Towards interlin-
arXiv preprint arXiv:1412.7024. gua neural machine translation. arXiv preprint
arXiv:1905.06831.
Josep Maria Crego, Jungi Kim, Guillaume Klein,
et al. 2016. Systran’s pure neural machine Chelsea Finn, Pieter Abbeel, and Sergey Levine.
translation systems. CoRR, abs/1610.05540. 2017. Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings of
Andrew M Dai and Quoc V Le. 2015. Semi- the 34th International Conference on Machine
supervised sequence learning. In Advances in Learning-Volume 70, pages 1126–1135. JMLR.
neural information processing systems, pages org.
3079–3087.
Orhan Firat, Kyunghyun Cho, and Yoshua Ben-
L. Deng, G. Hinton, and B. Kingsbury. 2013. gio. 2016a. Multi-way, multilingual neural ma-
New types of deep neural network learning for chine translation with a shared attention mech-
speech recognition and related applications: an anism. arXiv preprint arXiv:1601.01073.
overview. In 2013 IEEE International Confer-
ence on Acoustics, Speech and Signal Process- Orhan Firat, Kyunghyun Cho, Baskaran
ing, pages 8599–8603. Sankaran, Fatos T. Yarman Vural, and
Yoshua Bengio. 2017. Multi-way, multilingual
Jacob Devlin, Ming-Wei Chang, Kenton Lee, neural machine translation. Computer Speech
and Kristina Toutanova. 2018. Bert: Pre- & Language, 45:236 – 252.
training of deep bidirectional transformers
for language understanding. arXiv preprint Orhan Firat, Baskaran Sankaran, Yaser Al-
arXiv:1810.04805. Onaizan, Fatos T Yarman Vural, and
Kyunghyun Cho. 2016b. Zero-resource
George Doddington. 2002. Automatic evaluation translation with multi-lingual neural machine
of machine translation quality using n-gram co- translation. In Proceedings of the 2016
occurrence statistics. In Proceedings of the sec- Conference on Empirical Methods in Natural
ond international conference on Human Lan- Language Processing, pages 268–277.
guage Technology Research, pages 138–145.
Morgan Kaufmann Publishers Inc. Markus Freitag, Isaac Caswell, and Scott Roy.
2019. Text repair model for neural machine
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and translation. arXiv preprint arXiv:1904.04790.
Haifeng Wang. 2015. Multi-task learning for
multiple language translation. In Proceedings Jonas Gehring, Michael Auli, David Grangier,
of the 53rd Annual Meeting of the Association Denis Yarats, and Yann N. Dauphin. 2017.
20
Convolutional sequence to sequence learning. Jiatao Gu, Yong Wang, Kyunghyun Cho, and Vic-
CoRR, abs/1705.03122. tor OK Li. 2019. Improved zero-shot neural
machine translation via ignoring spurious cor-
Dan Gillick, Cliff Brunk, Oriol Vinyals, and relations. arXiv preprint arXiv:1906.01181.
Amarnag Subramanya. 2015. Multilingual lan-
guage processing from bytes. arXiv preprint Caglar Gulcehre, Orhan Firat, Kelvin Xu,
arXiv:1512.00103. Kyunghyun Cho, Loic Barrault, Huei-Chi Lin,
Fethi Bougares, Holger Schwenk, and Yoshua
Xavier Glorot and Yoshua Bengio. 2010. Under- Bengio. 2015. On using monolingual corpora
standing the difficulty of training deep feedfor- in neural machine translation. arXiv preprint
ward neural networks. In Proceedings of the arXiv:1503.03535.
thirteenth international conference on artificial
intelligence and statistics, pages 249–256. Vineet Gupta, Tomer Koren, and Yoram Singer.
2018. Shampoo: Preconditioned stochas-
Siavash Golkar, Michael Kagan, and Kyunghyun tic tensor optimization. arXiv preprint
Cho. 2019. Continual learning via neural prun- arXiv:1802.09568.
ing. CoRR, abs/1903.04476.
David Ha, Andrew Dai, and Quoc V Le.
Ian Goodfellow, Yoshua Bengio, and Aaron 2016a. Hypernetworks. arXiv preprint
Courville. 2016. Deep Learning. The MIT arXiv:1609.09106.
Press.
Thanh-Le Ha, Jan Niehues, and Alexander
Alex Graves, Marc G Bellemare, Jacob Menick, Waibel. 2016b. Toward multilingual neural ma-
Remi Munos, and Koray Kavukcuoglu. 2017. chine translation with universal encoder and de-
Automated curriculum learning for neural net- coder. arXiv preprint arXiv:1611.04798.
works. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, Thanh-Le Ha, Jan Niehues, and Alexander H.
pages 1311–1320. JMLR. org. Waibel. 2016c. Toward multilingual neural ma-
chine translation with universal encoder and de-
Stephen Grossberg. 1988. Neurocomputing: coder. CoRR, abs/1611.04798.
Foundations of research. chapter How Does the
Brain Build a Cognitive Code?, pages 347–399. Aaron Harlap, Deepak Narayanan, Amar Phan-
MIT Press, Cambridge, MA, USA. ishayee, Vivek Seshadri, Gregory R Ganger,
and Phillip B Gibbons. Pipedream: Pipeline
Jiatao Gu, James Bradbury, Caiming Xiong, parallelism for dnn training.
Victor OK Li, and Richard Socher. 2017.
Non-autoregressive neural machine translation. Hany Hassan, Anthony Aue, Chang Chen, et al.
arXiv preprint arXiv:1711.02281. 2018. Achieving human parity on automatic
chinese to english news translation. arXiv
Jiatao Gu, Hany Hassan, Jacob Devlin, and Vic- preprint arXiv:1803.05567.
tor O.K. Li. 2018a. Universal neural machine
translation for extremely low resource lan- Joel Hestness, Sharan Narang, Newsha Ardalani,
guages. In Proceedings of the 2018 Conference Gregory Diamos, Heewoo Jun, Hassan Kian-
of the North American Chapter of the Associ- inejad, Md. Mostofa Ali Patwary, Yang Yang,
ation for Computational Linguistics: Human and Yanqi Zhou. 2017. Deep learning scal-
Language Technologies, Volume 1 (Long Pa- ing is predictable, empirically. arXiv preprint
pers), pages 344–354, New Orleans, Louisiana. arXiv:1712.00409.
Association for Computational Linguistics.
Chris Hokamp, John Glover, and Demian
Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Gholipour. 2019. Evaluating the super-
Cho, and Victor OK Li. 2018b. Meta-learning vised and zero-shot performance of multi-
for low-resource neural machine translation. lingual translation models. arXiv preprint
arXiv preprint arXiv:1808.08437. arXiv:1906.09675.
21
Neil Houlsby, Andrei Giurgiu, Stanislaw Jas- ceedings of the 2013 Conference on Empiri-
trzebski, Bruna Morrone, Quentin de Larous- cal Methods in Natural Language Processing,
silhe, Andrea Gesmundo, Mona Attariyan, pages 1700–1709, Seattle, Washington, USA.
and Sylvain Gelly. 2019. Parameter-efficient Association for Computational Linguistics.
transfer learning for nlp. arXiv preprint
Zhuoliang Kang, Kristen Grauman, and Fei Sha.
arXiv:1902.00751.
2011. Learning with whom to share in multi-
Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, task feature learning. In Proceedings of the
Jean Oh, and Chris Dyer. 2016. Attention- 28th International Conference on International
based multimodal neural machine translation. Conference on Machine Learning, ICML’11,
In Proceedings of the First Conference on Ma- pages 521–528, USA. Omnipress.
chine Translation: Volume 2, Shared Task Pa-
Eliyahu Kiperwasser and Miguel Ballesteros.
pers, volume 2, pages 639–645.
2018. Scheduled multi-task learning: From
Yanping Huang, Yonglong Cheng, Dehao Chen, syntax to translation. Transactions of the Asso-
HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, ciation for Computational Linguistics, 6:225–
and Zhifeng Chen. 2018. Gpipe: Efficient 240.
training of giant neural networks using pipeline James Kirkpatrick, Razvan Pascanu, Neil Ra-
parallelism. arXiv preprint arXiv:1811.06965. binowitz, et al. 2017. Overcoming catas-
Sébastien Jean, Orhan Firat, and Melvin John- trophic forgetting in neural networks. Pro-
son. 2018. Adaptive scheduling for multi-task ceedings of the national academy of sciences,
learning. In Advances in Neural Information 114(13):3521–3526.
Processing Systems, Workshop on Continual Philipp Koehn. 2005. Europarl: A Parallel Corpus
Learning, Montreal, Canada. for Statistical Machine Translation. In Confer-
ence Proceedings: the tenth Machine Transla-
Melvin Johnson, Mike Schuster, Quoc V Le, et al.
tion Summit, pages 79–86, Phuket, Thailand.
2017. Google’s multilingual neural machine
AAMT, AAMT.
translation system: Enabling zero-shot transla-
tion. Transactions of the Association of Com- Julia Kreutzer and Artem Sokolov. 2018.
putational Linguistics, 5(1):339–351. Learning to segment inputs for NMT fa-
vors character-level processing. CoRR,
Mahesh Joshi, Mark Dredze, William W. Cohen, abs/1810.01480.
and Carolyn Rose. 2012. Multi-domain learn-
ing: When do domains matter? In Proceed- Taku Kudo and John Richardson. 2018. Senten-
ings of the 2012 Joint Conference on Empiri- cePiece: A simple and language independent
cal Methods in Natural Language Processing subword tokenizer and detokenizer for neural
and Computational Natural Language Learn- text processing. In Proceedings of the 2018
ing, pages 1302–1312, Jeju Island, Korea. As- Conference on Empirical Methods in Natural
sociation for Computational Linguistics. Language Processing: System Demonstrations,
pages 66–71, Brussels, Belgium. Association
Norman P. Jouppi, Cliff Young, Nishant Patil, for Computational Linguistics.
et al. 2017. In-datacenter performance analy-
sis of a tensor processing unit. Gaurav Kumar, George Foster, Colin Cherry, and
Maxim Krikun. 2019. Reinforcement learning
Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, based curriculum optimization for neural ma-
Ashish Vaswani, Niki Parmar, Llion Jones, and chine translation. CoRR, abs/1903.00041.
Jakob Uszkoreit. 2017. One model to learn
them all. CoRR, abs/1706.05137. Brenden M. Lake, Ruslan R. Salakhutdinov, Ja-
son Gross, and Joshua B. Tenenbaum. 2011.
Nal Kalchbrenner and Phil Blunsom. 2013. Re- One shot learning of simple visual concepts. In
current continuous translation models. In Pro- CogSci.
22
Surafel Melaku Lakew, Aliia Erofeeva, Matteo Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen,
Negri, Marcello Federico, and Marco Turchi. Lichan Hong, and Ed H. Chi. 2018a. Modeling
2018. Transfer learning in multilingual neural task relationships in multi-task learning with
machine translation with dynamic vocabulary. multi-gate mixture-of-experts. In Proceedings
CoRR, abs/1811.01137. of the 24th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data
Guillaume Lample, Ludovic Denoyer, and Mining, KDD ’18, pages 1930–1939, New
Marc’Aurelio Ranzato. 2017. Unsupervised York, NY, USA. ACM.
machine translation using monolingual corpora
only. arXiv preprint arXiv:1711.00043. Qingsong Ma, Ondřej Bojar, and Yvette Graham.
Giwoong Lee, Eunho Yang, and Sung Hwang. 2018b. Results of the wmt18 metrics shared
2016. Asymmetric multi-task learning based task: Both characters and embeddings achieve
on task relatedness and loss. In International good performance. In Proceedings of the Third
Conference on Machine Learning, pages 230– Conference on Machine Translation: Shared
238. Task Papers, pages 671–688.
Jason Lee, Kyunghyun Cho, and Thomas Hof- Andreas Maurer, Massimiliano Pontil, and
mann. 2017. Fully character-level neural ma- Bernardino Romera-Paredes. 2016. The benefit
chine translation without explicit segmentation. of multitask representation learning. Journal of
Transactions of the Association for Computa- Machine Learning Research, 17:81:1–81:32.
tional Linguistics, 5:365–378.
Tom M. Mitchell. 1997. Machine learning. Mc-
Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Graw Hill series in computer science. McGraw-
Socher, and Caiming Xiong. 2019. Learn to Hill.
grow: A continual structure learning frame-
work for overcoming catastrophic forgetting. Grégoire Montavon, Wojciech Samek, and Klaus-
Wang Ling, Isabel Trancoso, Chris Dyer, and Robert Müller. 2018. Methods for interpreting
Alan W Black. 2015. Character-based neu- and understanding deep neural networks. Digi-
ral machine translation. arXiv preprint tal Signal Processing, 73:1–15.
arXiv:1511.04586.
Hyeonseob Nam and Bohyung Han. 2015.
David Lopez-Paz et al. 2017. Gradient episodic Learning multi-domain convolutional neu-
memory for continual learning. In Advances in ral networks for visual tracking. CoRR,
Neural Information Processing Systems, pages abs/1510.07945.
6467–6476.
Graham Neubig and Junjie Hu. 2018. Rapid adap-
Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas tation of neural machine translation to new lan-
Bhardwaj, Shaonan Zhang, and Jason Sun. guages. In Proceedings of the 2018 Confer-
2018. A neural interlingua for multilin- ence on Empirical Methods in Natural Lan-
gual machine translation. arXiv preprint guage Processing, pages 875–880. Association
arXiv:1804.08198. for Computational Linguistics.
Minh-Thang Luong, Quoc V Le, Ilya Sutskever,
Oriol Vinyals, and Lukasz Kaiser. 2015. Multi- Toan Q. Nguyen and David Chiang. 2017. Trans-
task sequence to sequence learning. arXiv fer learning across low-resource, related lan-
preprint arXiv:1511.06114. guages for neural machine translation. In Proc.
IJCNLP, volume 2, pages 296–301.
Minh-Thang Luong and Christopher D Manning.
2016. Achieving open vocabulary neural ma- Alex Nichol, Joshua Achiam, and John Schulman.
chine translation with hybrid word-character 2018. On first-order meta-learning algorithms.
models. arXiv preprint arXiv:1604.00788. arXiv preprint arXiv:1803.02999.
23
Myle Ott, Sergey Edunov, David Grangier, and When and why are pre-trained word embed-
Michael Auli. 2018. Scaling neural machine dings useful for neural machine translation. In
translation. arXiv preprint arXiv:1806.00187. HLT-NAACL.
S. J. Pan and Q. Yang. 2010. A survey on trans- Alec Radford, Karthik Narasimhan, Tim Sali-
fer learning. IEEE Transactions on Knowledge mans, and Ilya Sutskever. Improving language
and Data Engineering, 22(10):1345–1359. understanding by generative pre-training.
Kishore Papineni, Salim Roukos, Todd Ward, Maithra Raghu, Ben Poole, Jon Kleinberg, Surya
and Wei-Jing Zhu. 2002. Bleu: a method Ganguli, and Jascha Sohl Dickstein. 2017. On
for automatic evaluation of machine transla- the expressive power of deep neural networks.
tion. In Proceedings of 40th Annual Meeting In Proceedings of the 34th International Con-
of the Association for Computational Linguis- ference on Machine Learning-Volume 70, pages
tics, pages 311–318, Philadelphia, Pennsylva- 2847–2854. JMLR. org.
nia, USA. Association for Computational Lin- Prajit Ramachandran, Peter J Liu, and Quoc V
guistics. Le. 2016. Unsupervised pretraining for se-
German Ignacio Parisi, Ronald Kemker, Jose L. quence to sequence learning. arXiv preprint
Part, Christopher Kann, and Stefan Wermter. arXiv:1611.02683.
2018. Continual lifelong learning with neural Bharath Ramsundar, Steven Kearnes, Patrick Ri-
networks: A review. CoRR, abs/1802.07569. ley, Dale Webster, David Konerding, and Vijay
Anastasia Pentina, Viktoriia Sharmanska, and Pande. 2015. Massively multitask networks for
Christoph H. Lampert. 2015. Curriculum learn- drug discovery. arXiv:1502.02072.
ing of multiple tasks. In The IEEE Conference Bernardino Romera-Paredes and Philip H. S. Torr.
on Computer Vision and Pattern Recognition 2015. An embarrassingly simple approach to
(CVPR). zero-shot learning. In ICML.
A. Phillips and M Davis. 2009. Tags for Identify- Michael T Rosenstein, Zvika Marx, Leslie Pack
ing Languages. RFC 5646, RFC Editor. Kaelbling, and Thomas G Dietterich. 2005. To
transfer or not to transfer. In NIPS 2005 work-
Emmanouil Antonios Platanios, Mrinmaya shop on transfer learning, volume 898, pages
Sachan, Graham Neubig, and Tom Mitchell. 1–4.
2018. Contextual parameter generation for
universal neural machine translation. arXiv Aurko Roy, Ashish Vaswani, Arvind Neelakan-
preprint arXiv:1808.08493. tan, and Niki Parmar. 2018. Theory and experi-
ments on vector quantized autoencoders. arXiv
Emmanouil Antonios Platanios, Otilia Stretcu, preprint arXiv:1805.11063.
Graham Neubig, Barnabás Póczos, and Tom M.
Mitchell. 2019. Competence-based curriculum Sebastian Ruder. 2017. An overview of multi-
learning for neural machine translation. CoRR, task learning in deep neural networks. CoRR,
abs/1903.09848. abs/1706.05098.
Maja Popović. 2015. chrf: character n-gram f- Sebastian Ruder, Joachim Bingel, Isabelle Au-
score for automatic mt evaluation. In Proceed- genstein, and Anders Søgaard. 2017. Learn-
ings of the Tenth Workshop on Statistical Ma- ing what to share between loosely related tasks.
chine Translation, pages 392–395. arXiv preprint arXiv:1705.08142.
Matt Post. 2018. A call for clarity in reporting Andrei A Rusu, Neil C Rabinowitz, Guillaume
bleu scores. arXiv preprint arXiv:1804.08771. Desjardins, Hubert Soyer, James Kirkpatrick,
Koray Kavukcuoglu, Razvan Pascanu, and Raia
Ye Qi, Sachan Devendra, Felix Matthieu, Pad- Hadsell. 2016. Progressive neural networks.
manabhan Sarguna, and Neubig Graham. 2018. arXiv preprint arXiv:1606.04671.
24
Devendra Singh Sachan and Graham Neubig. sequence pre-training for language generation.
2018. Parameter sharing methods for multilin- arXiv preprint arXiv:1905.02450.
gual self-attentional translation models. Pro-
ceedings of the Third Conference on Machine Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh,
Translation. and Adam Coates. 2017. Cold fusion: Training
seq2seq models together with language models.
Rico Sennrich, Barry Haddow, and Alexandra arXiv preprint arXiv:1708.06426.
Birch. 2015. Improving neural machine trans-
lation models with monolingual data. arXiv Nitish Srivastava, Geoffrey Hinton, Alex
preprint arXiv:1511.06709. Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. 2014. Dropout: a simple way
Rico Sennrich, Barry Haddow, and Alexandra to prevent neural networks from overfitting.
Birch. 2016. Neural machine translation of The Journal of Machine Learning Research,
rare words with subword units. In Proceedings 15(1):1929–1958.
of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Milos Stanojevic and Khalil Sima’an. 2014.
Papers), volume 1, pages 1715–1725. BEER: BEtter evaluation as ranking. In Pro-
ceedings of the Ninth Workshop on Statisti-
Noam Shazeer, Youlong Cheng, Niki Parmar, cal Machine Translation, pages 414–419, Bal-
et al. 2018. Mesh-tensorflow: Deep learning timore, Maryland, USA. Association for Com-
for supercomputers. In Advances in Neural In- putational Linguistics.
formation Processing Systems, pages 10414–
10423. Emma Strubell, Ananya Ganesh, and Andrew Mc-
Callum. 2019. Energy and policy considera-
Noam Shazeer, Azalia Mirhoseini, Krzysztof
tions for deep learning in nlp.
Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. 2017. Outrageously Ilya Sutskever, Oriol Vinyals, and Quoc V Le.
large neural networks: The sparsely-gated 2014. Sequence to sequence learning with neu-
mixture-of-experts layer. arXiv preprint ral networks. In Advances in neural informa-
arXiv:1701.06538. tion processing systems, pages 3104–3112.
Noam Shazeer and Mitchell Stern. 2018. Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and
Adafactor: Adaptive learning rates with Tie-Yan Liu. 2019. Multilingual neural ma-
sublinear memory cost. arXiv preprint chine translation with knowledge distillation.
arXiv:1804.04235. arXiv preprint arXiv:1902.10461.
Jonathan Shen, Patrick Nguyen, Yonghui Wu, Sebastian Thrun and Tom M. Mitchell. 1995.
et al. 2019. Lingvo: a modular and scalable Lifelong robot learning. Robotics and Au-
framework for sequence-to-sequence modeling. tonomous Systems, 15(1):25 – 46. The Biol-
arXiv preprint arXiv:1902.08295. ogy and Technology of Intelligent Autonomous
Daniel L. Silver, Qiang Yang, and Lianghao Li. Agents.
2013. Lifelong machine learning systems: Be-
Sebastian Thrun and Lorien Pratt, editors. 1998.
yond learning algorithms. In AAAI Spring Sym-
Learning to Learn. Kluwer Academic Publish-
posium: Lifelong Machine Learning.
ers, Norwell, MA, USA.
Shagun Sodhani, Sarath Chandar, and Yoshua
Bengio. 2018. On training recurrent neu- Jörg Tiedemann. 2012. Parallel data, tools and
ral networks for lifelong learning. CoRR, interfaces in OPUS. In Proceedings of the
abs/1811.07017. Eighth International Conference on Language
Resources and Evaluation (LREC-2012), pages
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and 2214–2218, Istanbul, Turkey. European Lan-
Tie-Yan Liu. 2019. Mass: Masked sequence to guage Resources Association (ELRA).
25
Jörg Tiedemann. 2018. Emerging language Marlies van der Wees, Arianna Bisazza, and
spaces learned from massively multilingual cor- Christof Monz. 2017. Dynamic data selection
pora. arXiv preprint arXiv:1802.00273. for neural machine translation. arXiv preprint
arXiv:1708.00712.
Jakob Uszkoreit, Jay M Ponte, Ashok C Popat,
and Moshe Dubiner. 2010. Large scale par- Felix Wu, Angela Fan, Alexei Baevski, Yann N
allel document mining for machine transla- Dauphin, and Michael Auli. 2019. Pay less at-
tion. In Proceedings of the 23rd Interna- tention with lightweight and dynamic convolu-
tional Conference on Computational Linguis- tions. arXiv preprint arXiv:1901.10430.
tics, pages 1101–1109. Association for Compu-
tational Linguistics. Yonghui Wu, Mike Schuster, Zhifeng Chen,
et al. 2016. Google’s neural machine trans-
Ashish Vaswani, Noam Shazeer, Niki Parmar, lation system: Bridging the gap between hu-
Jakob Uszkoreit, Llion Jones, Aidan N Gomez, man and machine translation. arXiv preprint
Łukasz Kaiser, and Illia Polosukhin. 2017. At- arXiv:1609.08144.
tention is all you need. In Advances in Neural
Information Processing Systems, pages 5998– Hongyi Zhang, Yann N Dauphin, and Tengyu
6008. Ma. 2019. Fixup initialization: Residual learn-
ing without normalization. arXiv preprint
Ricardo Vilalta and Youssef Drissi. 2002. A per- arXiv:1901.09321.
spective view and survey of meta-learning. Ar-
tificial intelligence review, 18(2):77–95. Jiajun Zhang and Chengqing Zong. 2016. Ex-
ploiting source-side monolingual data in neu-
Oriol Vinyals, Charles Blundell, Timothy Lilli- ral machine translation. In Proceedings of the
crap, Koray Kavukcuoglu, and Daan Wierstra. 2016 Conference on Empirical Methods in Nat-
2016. Matching networks for one shot learning. ural Language Processing, pages 1535–1545.
In Proceedings of the 30th International Con-
ference on Neural Information Processing Sys- Yu Zhang and Qiang Yang. 2017. A sur-
tems, NIPS’16, pages 3637–3645, USA. Curran vey on multi-task learning. arXiv preprint
Associates Inc. arXiv:1707.08114.
Wei Wang, Taro Watanabe, Macduff Hughes, Tet- Jie Zhou, Ying Cao, Xuguang Wang, Peng Li,
suji Nakagawa, and Ciprian Chelba. 2018a. and Wei Xu. 2016. Deep recurrent models
Denoising neural machine translation training with fast-forward connections for neural ma-
with trusted data and online data selection. chine translation. Transactions of the Associa-
arXiv preprint arXiv:1809.00068. tion for Computational Linguistics, 4:371–383.
26
Language Id Language Id Language Id
Afrikaans af Hebrew iw Polish pl
Albanian sq Hindi hi Portuguese pt
Amharic am Hmong hmn Punjabi pa
Arabic ar Hungarian hu Romanian ro
Armenian hy Icelandic is Russian ru
Azerbaijani az Igbo ig Samoan sm
Basque eu Indonesian id Scots_Gaelic gd
Belarusian be Irish ga Serbian sr
Bengali bn Italian it Sesotho st
Bosnian bs Japanese ja Shona sn
Bulgarian bg Javanese jw Sindhi sd
Burmese my Kannada kn Sinhalese si
Catalan ca Kazakh kk Slovak sk
Cebuano ceb Khmer km Slovenian sl
Chinese zh Korean ko Somali so
Corsican co Kurdish ku Spanish es
Croatian hr Kyrgyz ky Sundanese su
Czech cs Lao lo Swahili sw
Danish da Latvian lv Swedish sv
Dutch nl Lithuanian lt Tajik tg
Esperanto eo Luxembourgish lb Tamil ta
Estonian et Macedonian mk Telugu te
Filipino/Tagalog tl Malagasy mg Thai th
Finnish fi Malay ms Turkish tr
French fr Malayalam ml Ukrainian uk
Frisian fy Maltese mt Urdu ur
Galician gl Maori mi Uzbek uz
Georgian ka Marathi mr Vietnamese vi
German de Mongolian mn Welsh cy
Greek el Nepali ne Xhosa xh
Gujarati gu Norwegian no Yiddish yi
Haitian_Creole ht Nyanja ny Yoruba yo
Hausa ha Pashto ps Zulu zu
Hawaiian haw Persian fa
Table 8: List of BCP-47 language codes used throughout this paper (Phillips and Davis, 2009).
.
27