Professional Documents
Culture Documents
Abstract 25
-Log Probability
20
Neural networks are known to be data hungry 15
and domain sensitive, but it is nearly impos-
10
arXiv:1910.02555v1 [cs.CL] 7 Oct 2019
20
but one drawback is that the statistics of differ-
15
ent domains or tasks are inherently divergent,
and smoothing over these differences can lead 10
to compensate for domain differences and adapt t, and then compute the conditional distribution
an NMT model trained on out-of-domain paral- p(yt |y<t , x) using a softmax layer. Both en-
lel data. Although we mainly examine NMT in coder and decoder are jointly trained to maximize
this paper, the general idea can be applied to other the log-likelihood of the parallel training corpus
tasks as well. (X, Y ):
X
We evaluate DDA in two different unsupervised max log p(yi |xi ; θN M T ).
θN M T
domain adaptation settings on four language pairs. (xi ,yi )∈(X,Y )
DDA demonstrates consistent improvements of up
to 4 BLEU points over an unadapted NMT base- During decoding, NMT models generate words
line, and up to 2 BLEU over an NMT baseline one by one. Specifically, at each time step t,
adapted using existing methods. An analysis re- the NMT model calculates the probability of next
veals that DDA significantly improves the NMT word pNMT (yt |y<t , x) for each of all previous hy-
(i)
model’s ability to generate words more frequently potheses {y≤t−1 }. After appending the new word
seen in in-domain data, indicating that DDA is a to the previous hypothesis, new scores would be
promising approach to domain adaptation of NMT calculated and top K ones are selected as new hy-
(i)
and neural models in general. potheses {y≤t }.
3 Domain Differential Adaptation words, which limits the model’s flexibility. Our
second more expressive deep adaptation (DDA-
In this section, we propose two approaches under
Deep) method enables the model to learn how to
the overall umbrella of the DDA framework: Shal-
make predictions based on the hidden states of
low Adaptation (DDA-Shallow) and Deep Adap-
LM-in, LM-out, and NMT-out. We freeze the pa-
tation (DDA-Deep). At a high level, both methods
rameters of the LMs and only fine-tune the fusion
capture the domain difference by two LMs, trained
strategy and NMT parameters.
on in-domain (LM-in) and out-of-domain (LM-
Formally, at each time step t, we have three hid-
out) monolingual data respectively. Without ac- (t) (t) (t)
den states sLM-out , sLM-in , and sNMT-out . We then
cess to in-domain parallel data, we want to adapt
concatenate them and use a gating strategy to com-
the NMT model trained on out-of-domain parallel
bine the three hidden states:
data (NMT-out) to approximate the NMT model
trained on in-domain parallel data (NMT-in). h i
(t) (t) (t) (t)
In the following sections, we assume that LM- sconcat = sLM-out ; sLM-in ; sNMT-out , (2.1)
in, LM-out as well as the NMT-out model have
been pretrained separately before being integrated. (t) (t) (t) (t)
gLM-out , gLM-in , gNMT-out = F sconcat , (2.2)
3.1 Shallow Adaptation (t) (t) (t) (t)
sDA = gLM-out sLM-out + gLM-in sLM-in
(t)
Table 1: Translation accuracy (BLEU; Papineni et al. (2002)) under different settings. The first three rows list the
language pair, the source domain, and the target domain. “LAW”, “MED” and “IT” represent law, medical and IT
domains, respectively. We use compare-mt (Neubig et al., 2019) to perform significance tests (Koehn, 2004) and
statistical significance compared with the best baseline is indicated with ∗ (p < 0.005) and † (p < 0.05).
et al., 2014) which contain data from several do- fusion learns to combine hidden states of LM-
mains and test on the multilingual TED test sets of in and NMT-out. We denote shallow fusion and
Duh (2018).5 We consider two language pairs for deep fusion as “LM-Shallow” and “LM-Deep”.
this setting, namely Czech and German to English. 2) The copied monolingual data model (Currey
The Czech-English and German-English datasets et al., 2017) which copies target in-domain mono-
consist of about 1M and 4.5M sentences respec- lingual data to the source side to form synthetic in-
tively and the development and test sets contain domain data. 3) Back-translation (Sennrich et al.,
about 2K sentences. Byte-pair encoding (Sennrich 2016a) which enriches the training data by gener-
et al., 2016b) is employed to process training data ating synthetic in-domain parallel data via a target-
into subwords with a vocabulary size of 50K for to-source NMT model which is trained on a out-
both settings. of-domain corpus.
8
Table 2: Performance of DDA-Deep when fusing dif-
6
ferent parts of models on the law and medical datasets.
4
We conduct experiments on the law and med-
2
ical datasets in OPUS, and experimental results
0 are shown in Table 2. We find that generally fus-
−2 ing hidden states is better than fusing word em-
beddings, and fusing hidden states together with
−4
word embeddings does not show any improve-
−4 −2 0 2 4 6 8 10 ments over simply fusing hidden states alone.
log pLM −LAW − log pLM −M ED These results indicate that combining the higher-
level information captured by the encoder states is
Figure 2: Correlation between log pNMT-LAW − more advantageous for domain adaptation. Also,
log pNMT-MED and log pLM-LAW − log pLM-MED . We de-
we found that directly using DDA-Deep to fuse
code each model on the medical set by feeding in the
gold labels and calculate the mean of total log proba- output probabilities was unstable even after try-
bilities. We plot 100 words that appear frequently in ing several normalization techniques, possibly be-
both domains. cause of the sensitivity of output probabilities.
200 189
Reference why has Ab- ili- fy been approved ?
157
150 143 reasons was received why a reminder
100 103 Baseline
80
63
82 was accepted ?
50 34 35 37
22 DDA-Shallow why has been approved?
0 0 0 0 0 0 0 0 0 0 0 0 2
LAW-MED LAW-IT MED-LAW MED-IT IT-LAW IT-MED
Copy why ,
87.8
80 DDA-Deep why was Ab- ili- fy authorised ?
60 58.1
52.5
45.7 Table 3: Translation examples under the law to medi-
40
27.3
34.2
25.8
29.6 cal adaptation setting.
20 17.7 18.3 17.4 18.7 17.8 17.1
13.2
11.4 11.313.0 8.8 11.4 10.6
4.9 2.4 5.3
0
LAW-MED LAW-IT MED-LAW MED-IT IT-LAW IT-MED
metric “Adaptation Extent” (AE) as follows:
Adaptation Accuracy
1.6 1.53
1.32
1.2 1 X freq in(w)
0.85
AE = count(w), (3)
0.8 0.76
0.68 |V | freq out(w)
0.54 0.570.61 w∈V
0.45
0.4 0.34
0.20 0.240.190.27 0.25
0.14 0.160.19 0.14
0.060.10
0
0.09
0.03
0.09
where V is the whole vocabulary, freq in(w) and
LAW-MED LAW-IT MED-LAW MED-IT IT-LAW IT-MED
freq out(w) represent the frequency of subword w
Figure 3: Number of generated domain-specific sub- in both in-domain and out-of-domain corpora, and
words, scores of adaptation extent and adaptation accu- count(w) measures how many times subword w
racy for each method. Top: count of words only exist in appears in the translation result.
in-domain data produced by different models; Middle:
We define “Adaptation Accuracy” (AA) in a
adaptation extent of different models; Bottom: adapta-
tion accuracy of different models. similar way:
1 X freq in(w)
AA = F 1(w), (4)
|V | freq out(w)
w∈V
data. We conduct analysis on the level of the sub-
words that were used in the MT system, and study where F 1 denotes the F1-score of subword w. In
whether our methods can generate in-domain sub- order to avoid dividing by zero, we use add-one
words that have never appeared or appeared less smoothing when calculating freq out(w). While
frequently in the out-of-domain dataset as well AE measures the quantity of in-domain subwords
as whether our methods can generate these in- the models can generate, AA tells us the quality
domain subwords accurately. of these subwords, namely whether the in-domain
subwords form meaningful translations.
First, we focus on domain-specific subwords,
We plot the AE and AA scores of our methods
i.e. subwords that appear exclusively in the in-
as well as the baselines in Figure 3. The AE scores
domain data. The counts of these subwords are
demonstrate that both DDA-Shallow and DDA-
shown in Figure 3. In general, both the base-
Deep adapt the model to a larger extent compared
line and DDA-Shallow struggle at generating sub-
to other baselines even though DDA-Shallow fails
words that never appear in the out-of-domain par-
to generate domain-specific subwords. In addi-
allel data. On the other hand, copying monolin-
tion, the AA scores reveal that DDA-Deep outper-
gual data performs better in this facet, because it
forms other methods in terms of adaptation accu-
exposes the model to subwords that appear only
racy while DDA-Shallow is relatively weak in this
in the in-domain data. DDA-Deep generates the
respect. However, it should be noted that there
largest number of in-domain subwords among the
is still large gap between deep adaptation method
four models, indicating the effectiveness of our
and the upper bound where the gold reference is
method.
used as a “translation”; the upper bound is about
Second, we propose two subword-level evalu- 10 for each setting.
ation metrics that study whether the models can We also we sample some translation results and
generate in-domain subwords and if the generated show them in Table 3 to qualitatively demonstrate
in-domain subwords are correct. We first define the differences between the methods. We could
140
Strategy LAW-MED MED-LAW no LMs
LM-in
LM-in + LM-out 18.02 6.51 120
LM-out
60
Table 4: Performance of ensembling different LMs on
the law and medical datasets. LMs-general are trained
40
with both in-domain and out-of-domain datasets.
20
0 1000 2000 3000 4000 5000 6000 7000 8000
Number of Iterations
see that by modifying the output probabilities,
the DDA-Shallow strategy has the ability to ad- Figure 4: Perplexity on the development set for each
just tokens translated by the baseline model to method under the continued training setting. “no
some extent, but it is not capable of generating the LMS”, “LM-in”, “LM-out” and “LM-in + LM-out” de-
domain-specific subwords “Ab- i li- fy”. However, note the baseline model, LM-Deep with LM-in, LM-
Deep with LM-out and DDA-Deep respectively.
the DDA-Deep strategy can encourage the model
to generate domain-specific tokens and make the
translation more correct.
All of the above quantitative and qualitative re- LM-in and LM-out by investigating how DDA-
sults indicate that our strategies indeed help the Deep behaves under a continued training setting,
model adapt from the source to target domains. where a small number of in-domain parallel sen-
tences are available. We first train the NMT-
5.4 Necessity of Integrating both LMs out model until convergence on the out-of-domain
corpus, and then fine-tune it with DDA-Deep on
In this section, we further examine the necessity the in-domain corpus. Here we use the medi-
of integrating both in-domain and out-of-domain cal and IT datasets as our out-of-domain and in-
LMs. Although previous experimental results par- domain corpora respectively, mainly because the
tially support the statement, we perform more de- baseline model performs poorly under this setting.
tailed analysis to ensure the gain in BLEU points We randomly select 10, 000 parallel sentences in
is because of the joint contribution of LM-in and the in-domain dataset for continued training.
LM-out.
We freeze LM-in and LM-out as before and
Ensembling LMs. Ensembling multiple models fine-tune the NMT-out model. The results are
is a common and broadly effective technique for shown in Figure 4. We find that the perplexity
machine learning, and one possible explanation of deep adaptation method on the development
for our success is that we are simply adding more set drops more dramatically compared to baseline
models into the mix. To this end, we compare models. Figure 4 shows that integrating only LM-
DDA-Deep with three models: the first one in- in or LM-out with the NMT model does not help,
tegrates NMT-out with two LMs-in trained with and sometimes even hurts the performance. This
different random seeds and the second one inte- finding indicates that there indeed exists some
grates NMT-out with two LMs-out; we also inte- correlation between LMs trained on different do-
grate two general-domain LMs which are trained mains. Using both LM-in and LM-out together is
on both the in-domain and out-of-domain data and essential for the NMT model to utilize the domain
compare the performance. The experimental re- difference to adapt more effectively.
sults are shown in Table 4.
However, if we look at the BLEU points on the
We can see that DDA-Deep achieves the best development set, DDA-deep with continued train-
performance compared with the three other mod- ing performs much worse than the baseline model
els, demonstrating the gain in BLEU is not simply (13.36 vs. 15.61), as shown in Table 5 (β = 0).
because of using more models. This sheds light on some limitations of our pro-
Continued Training. In this section, we attempt posed method, which we will discuss in the next
to gain more insights about the contribution of section.
Coverage penalty β 0.00 0.05 0.10 0.15 0.20 0.25 0.30
Baseline (no LMs) 15.61 16.28 17.26 17.59 17.21 16.39 15.96
LM-Deep (LM-out) 13.56 14.61 15.52 15.92 15.98 15.76 15.24
LM-Deep (LM-in) 12.00 13.36 14.56 15.10 15.62 15.98 15.57
DDA-Deep (LM-in + LM-out) 13.36 15.18 17.52 18.46 18.62 18.03 17.17
Table 5: BLEU points of models after continued training on the IT development dataset with different values of
coverage penalty β.
M Amin Farajian, Marco Turchi, Matteo Negri, and Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel,
Marcello Federico. 2017. Multi-domain neural ma- Danish Pruthi, and Xinyi Wang. 2019. compare-mt:
chine translation through unsupervised adaptation. A tool for holistic comparison of language genera-
In Conference on Machine Translation (WMT). tion systems. In Conference of the North American
Chapter of the Association for Computational Lin-
Markus Freitag and Yaser Al-Onaizan. 2016. Fast guistics (NAACL) Demo Track.
domain adaptation for neural machine translation.
arXiv preprint arXiv:1612.06897. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, evaluation of machine translation. In Annual Meet-
Pascal Germain, Hugo Larochelle, François Lavi- ing of the Association for Computational Linguistics
olette, Mario Marchand, and Victor Lempitsky. (ACL).
2016. Domain-adversarial training of neural net-
works. The Journal of Machine Learning Research, Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
17(1):2096–2030. Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun resentations. In Conference of the North American
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Chapter of the Association for Computational Lin-
Holger Schwenk, and Yoshua Bengio. 2015. On us- guistics (NAACL).
ing monolingual corpora in neural machine transla-
Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li,
tion. arXiv preprint arXiv:1503.03535.
Hainan Xu, Mahsa Yarmohamadi, and Sanjeev Khu-
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross danpur. 2018. Semi-orthogonal low-rank matrix
Girshick. 2017. Mask r-cnn. In International Con- factorization for deep neural networks. In Annual
ference on Computer Vision (ICCV). Conference of the International Speech Communi-
cation Association (INTERSPEECH).
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Swami Sankaranarayanan and Yogesh Balaji. 2017.
Long short-term memory. Neural computation,
Generate to adapt: Aligning domains using gener-
9(8):1735–1780.
ative adversarial networks. In Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Junjie Hu, Mengzhou Xia, Graham Neubig, and Jaime
Carbonell. 2019. Domain adaptation of neural ma- Rico Sennrich, Barry Haddow, and Alexandra Birch.
chine translation by lexicon induction. In Annual 2016a. Improving neural machine translation mod-
Meeting of the Association for Computational Lin- els with monolingual data. In Annual Meeting of the
guistics (ACL). Association for Computational Linguistics (ACL).
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Rico Sennrich, Barry Haddow, and Alexandra Birch.
Senellart, and Alexander M Rush. 2017. Opennmt: 2016b. Neural machine translation of rare words
Open-source toolkit for neural machine translation. with subword units. In Annual Meeting of the As-
arXiv preprint arXiv:1701.02810. sociation for Computational Linguistics (ACL).
Philipp Koehn. 2004. Statistical significance tests for Aditya Siddhant, Anuj Goyal, and Angeliki Metalli-
machine translation evaluation. In Conference on nou. 2019. Unsupervised transfer learning for spo-
Empirical Methods in Natural Language Processing ken language understanding in intelligent agents. In
(EMNLP). AAAI Conference on Artificial Intelligence (AAAI).
Philipp Koehn and Rebecca Knowles. 2017. Six chal- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
lenges for neural machine translation. In Workshop Sequence to sequence learning with neural net-
on Neural Machine Translation (WMT). works. In Advances in Neural Information Process-
ing Systems (NeurIPS).
Mingsheng Long, Yue Cao, Jianmin Wang, and
Michael I Jordan. 2015. Learning transferable fea- Ben Tan, Yu Zhang, Sinno Jialin Pan, and Qiang Yang.
tures with deep adaptation networks. In Interna- 2017. Distant domain transfer learning. In AAAI
tional Conference on Machine Learning (ICML). Conference on Artificial Intelligence (AAAI).
Jörg Tiedemann. 2012. Parallel data, tools and inter-
faces in opus. In International Conference on Lan-
guage Resources and Evaluation (LREC).
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,
and Hang Li. 2016. Modeling coverage for neural
machine translation. In Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL).
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems (NeurIPS).
Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen,
and Eiichiro Sumita. 2017. Instance weighting for
neural machine translation domain adaptation. In
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP).
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy,
and Graham Neubig. 2017. Controllable invariance
through adversarial feature learning. In Advances in
Neural Information Processing Systems (NeurIPS).
Wei Ying, Yu Zhang, Junzhou Huang, and Qiang Yang.
2018. Transfer learning via learning to transfer.
In International Conference on Machine Learning
(ICML).