You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359274746

BERT-fused Model for Finnish-Swedish Translation

Conference Paper · December 2021


DOI: 10.1109/SNAMS53716.2021.9731849

CITATIONS READS
0 77

2 authors, including:

Jouko Vankka
The Finnish Defence Forces
115 PUBLICATIONS 1,428 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jouko Vankka on 20 March 2022.

The user has requested enhancement of the downloaded file.


BERT-fused Model for Finnish-Swedish Translation
Iiro Kumpulainen∗† , Jouko Vankka∗
∗ Department of Military Technology
National Defence University
Helsinki, Finland
jouko.vankka@mil.fi
† Department of Computer Science

University of Helsinki
Helsinki, Finland
iiro.kumpulainen@helsinki.fi

Abstract—Translation between Finnish and Swedish is a com- model by Virtanen et al. [4]. We compare the results with
mon yet time-consuming and expensive task. In this paper, we previous research work, as well as with translation engines
train new neural machine translation models and compare them Google Translate, Microsoft Translator, and EU Council Pres-
with publicly available tools for automatic translation of Finnish
to Swedish. Furthermore, we analyze if fusing BERT models with idency translator1 , which are available online. Our models
traditional Transformer models produces better translations. We and translation results are made publicly available online at
train a base Transformer and a large Transformer model using https://github.com/MusserO/BERT-fused fi-sv.
Fairseq and compare the results with BERT-fused versions of
the models. Our large transformer model matches the state-of-
the-art performance in Finnish-Swedish translation and slightly II. R ELATED W ORK
improves the BLEU score from 29.4 to 29.8. In our experiments,
fusing the smaller Transformer model with a pre-trained BERT NMT models have previously been trained for Finnish-
improves the quality of the translations. Surprisingly, the larger Swedish by Tiedemann et al. [5] as a part of a project called
Transformer model in contrast does not benefit from being fused FISKMÖ. The FISKMÖ project aims to provide a large corpus
with a BERT model. of parallel data in Finnish and Swedish, and to develop openly
Index Terms—machine translation, natural language process- accessible translation services between the two languages. We
ing, NMT, NLP, BERT, Finnish, Swedish
follow their work by using the same datasets and replicating
their NMT model to use it as a baseline. We then expand on
I. I NTRODUCTION their work by training two new NMT models and experiment
Finnish and Swedish are the two official languages of with fusing them with a pre-trained Finnish language BERT
Finland, and translating between the two languages is nec- model.
essary in many different contexts. These translation efforts Another project with NMT for Finnish-Swedish is OPUS-
are expensive and time-consuming, and better tools for au- MT by Tiedemann and Thottingal [6]. In OPUS-MT, the focus
tomating the translation process are required. In this paper, was not on Finnish-Swedish, but the authors trained NMT
we train new neural machine translation (NMT) models for models for over a thousand language pairs. For their Finnish-
Finnish-Swedish translation, and compare them with existing Swedish translation model, they used the same parallel dataset
automatic translation systems. In addition, we analyze if fusing and training framework as in the FISKMÖ project.
BERT models with traditional Transformer models produces While fine-tuning BERT models has been successful in a
better translations in this context. wide variety of NLP tasks, there is little work on successfully
The BERT model is a state-of-the-art method for a wide incorprating BERT into machine translation. Zhu et al. [3]
range of natural language processing (NLP) tasks such as managed to achieve state-of-the-art results for a number of
language inference and question answering [1]. On the other English language translation tasks by introducing BERT-fused
hand, current state-of-the-art neural machine translation mod- models, where a pre-trained NMT model and a pre-trained
els rely on the Transformer architecture, which use self- BERT-model are fused together. In this work, we apply this
attention mechanisms to compute representations for the input method for Finnish-Swedish translation by training new NMT
and output sequences, introduced by Vaswani et al. [2]. models and fusing them with a Finnish language BERT model
Incorporating a language understanding model such as trained by Virtanen et al. [4]. While the experiments by Zhu
BERT into neural machine translation seems like a promising et al. [3] suggest that BERT-fused models outperform the
idea, but achieving better results in practice has proven to original Transformer models, we show that this is not always
be difficult. We follow the approach by Zhu et al. [3] by the case for large Transformer models.
training our NMT model for Finnish-Swedish translation,
and fusing it with a pre-trained Finnish language BERT 1 https://presidencymt.eu/
III. M ETHODS
Following the example by Tiedemann et al. [5], in order
to train our models, we used data from OPUS, the open
parallel corpus [7]. We used all the available OPUS datasets
for training, except for the Tatoeba dataset, which we used
for validation and testing. Our training dataset consists of
30,877,167 sentences that are both in Finnish and Swedish.
From the Tatoeba dataset, we used 5,000 randomly selected
sentences for validation and the remaining 4,778 sentences for
testing.
In order to keep our training setups comparable with pre-
vious work, we tried to match our training setup with the
training environment in the FISKMÖ project [5]. This way
possible improvements in performance would be caused by
changes in the methodology rather than by a large increase in
computing power. We used four Tesla V100 GPUs for training,
and we first trained a baseline 6-layer Transformer model
with 8 attention heads using Marian NMT framework [8]
with similar hyperparameters as Tiedemann et al. [5]. All the
details about model architecture and hyperparameters that are
necessary for reproducing our results can be found from our
publicly available online repository2 .
Next, in order to get an NMT model that is compatible
with the Finnish language BERT model by Virtanen et al.
[4], we trained a 6-layer Transformer with 8 attention heads
using Fairseq, a sequence modeling toolkit for Python [9]. We
tried to set the hyperparameters to match the model trained
with Marian, but several distinctions remained due to the
differences in the Fairseq and Marian frameworks. After pre-
training the model until convergence, we fused the Fairseq
model with the BERT model using a recently updated version
of the code provided by Zhu et al. [3].
The fused model is then further trained on the same training
data until the new model converges. Finally, we repeat the
experiment with a larger 6-layer Transformer with 16 attention
heads using Fairseq. For this model, we use hyperparameters
following Fairseq recommendations for scaling models [10].
The architecture of a fused Transformer NMT model and
BERT model is shown in Figure 1. In the fused model, inputs
are first fed to the BERT model, which produces an encoded
representation of the input. This representation given by the
last layer of the BERT model is then used in the attention
components in each encoder and decoder layer in the fused
NMT model. Regular Transformer models use self-attention
components, as well as encoder-decoder attention components
in the decoding layers. The BERT-fused model augments these
attention components by adding BERT attention to the self-
attention in the encoding layers and to the encoder-decoder
attention in the decoding layers.
We omit detailed background explanations of the attention Fig. 1. The architecture of a 6-layer Transformer model that has been fused
with a BERT model.
mechanism and the Transformer model architecture to increase
readability. For more details about the Transformer model
architecture, we refer readers to the publication by Vaswani

2 https://github.com/MusserO/BERT-fused fi-sv
et al. [2] and to guides such as ”The Annotated Transformer” 3 . the training times of our models are not compared here but
For a more detailed description of BERT attention and the approximate training durations be found from our publicly
method of fusing BERT with an NMT model, we refer to the available repository.
publication by Zhu et al. [3]. When fusing the base Fairseq model with the Finnish BERT
model, the results clearly improved. However, when fusing
IV. R ESULTS AND D ISCUSSION
the BERT model with our large Fairseq model, the scores
For comparing our models with previous methods and did not improve, but slightly decreased. These results suggest
existing translation engines online, we used the same testing that fusing BERT with NMT may be useful in low-resource
dataset as FISKMÖ, consisting of 523 parallel sentences scenarios when larger models cannot be trained, but in high-
translated by hand [5]. While we reserved a part of the resource scenarios BERT-fused models are not necessarily
Tatoeba dataset for testing, these were not used for these better than the original Transformer models. However, more
test results and comparisons between previous models because experiments are needed to determine how and when to use
at least FISKMÖ and OPUS-MT used the Tatoeba data for BERT-fused NMT models most effectively.
training. To automatically evaluate the translation quality, we
use SacreBLEU [11] and chrF [12] scores. V. C ONCLUSION AND F UTURE W ORK
We compare our translation results with online translation In this work, we trained neural machine translation models
engines Google Translate, Microsoft Translator, and EU Coun- for translating Finnish into Swedish and compared the results
cil Presidency, as well as the previous works on Finnish- with previous research and publicly available translation sys-
Swedish translation FISKMÖ and OPUS-MT. The Swedish tems. Our large transformer model matched the state-of-the-
output translations for each of the models are available in our art performance in Finnish-Swedish translation and slightly
repository for a qualitative analysis of the translations. The improved the BLEU score from 29.4 to 29.8. The model is
BLEU and chrF scores are shown in Table I. made publicly available online.
In addition, we investigated the benefits of fusing a BERT
Translation Engine BLEU chrF
Google Translate 20.8 0.548 model with Transformer models in the context of Finnish-
Microsoft Translator 20.5 0.547 Swedish translation. In our experiments fusing a pre-trained
EU Presidency Translator 29.4 0.626 BERT model was beneficial for a smaller Transformer model,
FISKMÖ 26.6 0.593
OPUS-MT 27.8 0.603
whereas our large Transformer model did not improve from
Marian Base (Ours) 27.5 0.602 being fused with a BERT model. These results agree with
Fairseq Base (Ours) 20.3 0.535 previous work that BERT-fused models lead to better transla-
BERT-fused Base (Ours) 23.4 0.560 tions for smaller models. In contrast with the previous work,
Fairseq Large (Ours) 29.8 0.611
BERT-fused Large (Ours) 28.7 0.607 our results also show that large Transformer models do not
always benefit from being fused with a BERT model. This
TABLE I
C OMPARISON OF AVAILABLE TRANSLATION SYSTEMS AND MODELS suggests that fusing machine translation models with BERT
TRAINED BY US FOR F INNISH -S WEDISH TRANSLATION . B EST SCORES models may be useful primarily for language pairs that do not
ARE MARKED IN BOLD . have enough parallel training data for training large machine
translation models. However, more experiments are needed
before this can be confidently concluded.
Google Translate and Microsoft Translator achieve very
Future work may expand our work to different datasets and
similar scores, whereas EU Council Presidency Translator
language pairs to gain further insight as to when it is beneficial
provides significantly better translations. As expected, our
to use BERT-fused models for NMT. In addition, future work
base Marian Transformer model achieves results similar to
may analyze if using a larger BERT model that matches the
FISKMÖ and OPUS-MT, which also use OPUS dataset and
size of the large Transformer model could improve the results.
Marian framework [5, 6]. These methods outperform Google
Besides using a BERT model, monolingual data from the
Translate and Microsoft Translator but fall slightly behind EU
target language may be included in the training by using back-
Council Presidency Translator.
translation, which has been shown to be an effective data
Surprisingly, our base Fairseq Transformer model falls
augmentation technique for machine translation tasks [13].
significantly behind the Marian model, despite using same
While the BERT model in our experiments is trained us-
architecture and our attempts to use similar hyperparameters.
ing Finnish sentences, back-translation would use Swedish
We therefore believe it would be better to use different
sentences by translating them to Finnish using a Swedish-
hyperparameters for Fairseq models due to the differences in
Finnish translation model that has been trained on a parallel
the frameworks. When training the large Fairseq model, the
corpus. These artificially translated sentences could then be
hyperparameters were more suitable for Fairseq and resulted
added to the training data of a new model for translating
in faster training and state-of-the-art results matching the EU
Finnish to Swedish by training the new model to retrieve
Council Presidency Translator. Since the hyperparameters for
training the smaller Fairseq model was clearly not optimal, 3 Our chrF score seems to slightly differ from that used by Tiedemann
et al. [5]. We used the Python NLTK module version 3.5 implementation of
3 http://nlp.seas.harvard.edu/2018/04/03/attention.html crhf score with default parameters.
the original Swedish sentences from the artificial translations. [11] M. Post, “A call for clarity in reporting bleu scores,”
Furthermore, the translation results may also be improved in Proceedings of the Third Conference on Machine
by using ensembles of multiple models, which often slightly Translation: Research Papers, 2018, pp. 186–191.
outperform the individual models. [12] M. Popović, “chrf: character n-gram f-score for automatic
mt evaluation,” in Proceedings of the Tenth Workshop on
Statistical Machine Translation, 2015, pp. 392–395.
R EFERENCES [13] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Un-
derstanding back-translation at scale,” in Proceedings of
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, the 2018 Conference on Empirical Methods in Natural
“Bert: Pre-training of deep bidirectional transformers for Language Processing, 2018, pp. 489–500.
language understanding,” in Proceedings of the 2019
Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), 2019,
pp. 4171–4186.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” in Advances in neural infor-
mation processing systems, 2017, pp. 5998–6008.
[3] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou,
H. Li, and T. Liu, “Incorporating bert into neural
machine translation,” in International Conference on
Learning Representations, 2020. [Online]. Available:
https://openreview.net/forum?id=Hyl7ygStwB
[4] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luoto-
lahti, T. Salakoski, F. Ginter, and S. Pyysalo, “Multi-
lingual is not enough: Bert for finnish,” arXiv preprint
arXiv:1912.07076, 2019.
[5] J. Tiedemann, T. Nieminen, M. Aulamo, J. Kanerva,
A. Leino, F. Ginter, and N. Papula, “The fiskmö project:
Resources and tools for finnish-swedish machine transla-
tion and cross-linguistic research,” in Proceedings of The
12th Language Resources and Evaluation Conference,
2020, pp. 3808–3815.
[6] J. Tiedemann and S. Thottingal, “Opus-mt–building open
translation services for the world,” in Proceedings of the
22nd Annual Conference of the European Association for
Machine Translation, 2020, pp. 479–480.
[7] J. Tiedemann, “Parallel data, tools and interfaces in
opus,” in Proceedings of the Eighth International Confer-
ence on Language Resources and Evaluation (LREC’12),
2012, pp. 2214–2218.
[8] M. Junczys-Dowmunt, R. Grundkiewicz, T. Dwojak,
H. Hoang, K. Heafield, T. Neckermann, F. Seide, U. Ger-
mann, A. F. Aji, N. Bogoychev et al., “Marian: Fast
neural machine translation in c++,” in Proceedings of
ACL 2018, System Demonstrations, 2018, pp. 116–121.
[9] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng,
D. Grangier, and M. Auli, “fairseq: A fast, extensible
toolkit for sequence modeling,” in Proceedings of the
2019 Conference of the North American Chapter of the
Association for Computational Linguistics (Demonstra-
tions), 2019, pp. 48–53.
[10] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling
neural machine translation,” in Proceedings of the Third
Conference on Machine Translation (WMT), 2018.

View publication stats

You might also like