You are on page 1of 90

Artificial Intelligence Review (2023) 56:10137–10226

https://doi.org/10.1007/s10462-023-10423-5

Machine translation and its evaluation: a study

Subrota Kumar Mondal1 · Haoxi Zhang1 · H. M. Dipu Kabir2 · Kan Ni1 ·


Hong‑Ning Dai3

Published online: 19 February 2023


© The Author(s), under exclusive licence to Springer Nature B.V. 2023

Abstract
Machine translation (namely MT) has been one of the most popular fields in computational
linguistics and Artificial Intelligence (AI). As one of the most promising approaches, MT
can potentially break the language barrier of people from all over the world. Despite a
number of studies in MT, there are few studies in summarizing and comparing MT meth-
ods. To this end, in this paper, we principally focus on presenting the two mainstream MT
schemes: statistical machine translation (SMT) and neural machine translation (NMT),
including their basic rationales and developments. Meanwhile, the detailed translation
models are also presented, such as the word-based model, syntax-based model, and phrase-
based model in statistical machine translation. Similarly, approaches in NMT, such as the
recurrent neural network-based, attention mechanism-based, and transformer-based models
are presented. Last but not least, the evaluation approaches also play an important role in
helping developers to improve their methods better in MT. The prevailing machine transla-
tion evaluation methodologies are also presented in this article.

Keywords Natural Language Processing · Computational linguistics · Statistical machine


translation · Neural machine translation · Evaluation methods

* Subrota Kumar Mondal


skmondal@must.edu.mo
Haoxi Zhang
zhanghaoxi97@gmail.com
H. M. Dipu Kabir
hussain.kabir@deakin.edu.au
Kan Ni
nikan1996nikan1996@gmail.com
Hong‑Ning Dai
hndai@ieee.org
1
School of Computer Science and Engineering, Macau University of Science and Technology,
Taipa 999078, Macao, China
2
Deakin University, Geelong, Australia
3
The Department of Computer Science, Hong Kong Baptist University, Hong Kong, China

13
Vol.:(0123456789)
10138 S. K. Mondal et al.

Abbreviations
MT Machine translation
NLP Natural Language Processing
RBMT Rule-based Machine Translation
CBMT Corpus-based Machine Translation
SMT Statistical machine translation
EBMT Example-based Machine Translation
HMT Hybrid Machine Translation
NMT Neural machine translation
EM Expectation-Maximization
WBMT Word-based Machine Translation
SBMT Syntax-based Machine Translation
PBMT Phrase-based Machine Translation
CFG Context-Free Grammar
SCFG Synchronous Context-Free Grammar
ITG Inversion Transduction Grammar
TER Translation Edit Rat
HTER Human-targeted Translation Edit Rat
mTER Multi-reference TER
GNMT Google’s Neural Machine Translation
BP Backpropagation
BT Back-Translation
NN Neural Network
CNN Convolutional Neural Network
RNN Recurrent Neural Network
TNN Transformer Neural Network
Gate Recurrent Unit
GRU​
LSTM Long-Short Term Memory
BERT Bidirectional Encoder Representations from Transformers
LRL Low-Resource Language
HRL High-Resource Language
Enc Encoder
Dec Decoder
ALPAC Automatic Language Processing Advisory Committee
DARPA Defense Advanced Research Projects Agency
BLEU Bilingual Evaluation Understudy
NIST National Institute of Standards and Technology
METEOR Metric for Evaluation of Translation with Explicit ORdering
ROUGE Recall-Oriented Understudy for Gisting Evaluation
OpenMT Open Machine Translation Evaluation
T5 Text-To-Text Transfer Transformer
UNK Unknown

13
Machine translation and its evaluation: a study 10139

1 Introduction

As the media of people communicating with each other, language reflects human’s think-
ing mode. With rapid technological development, people across the world become virtually
connected through the Internet. As people from all over the world are positively linked, the
cultural exchanges and communications among different countries become more and more
frequent. The language barrier among different cultures becomes the bottleneck while
people of different cultures aim to share their thoughts. Machine translation (MT) plays
a key role in the context. As a sub-field of computational linguistics, MT converts texts
from one language to another relying on computers. Machine translation develops rapidly
since French engineer G.B. Artsouni’s first introduction of translating using a machine
and its product “Mechanical Brain” in the 1930s (Zakir and Nagoor 2017). The process
of machine translation is very similar to a compiler, which involves lexical analysis, syn-
tax analysis, semantic analysis, code optimization, target machine code generation, etc.
However, many texts in the natural language are highly confusing. Their literal meaning
depends on many factors, such as tone, context, sequence, etc. Therefore, some optimal
decision models are needed for the translation system to select the best estimate.

1.1 Machine translation approaches

In this section, we present the commonly-used MT approaches.

1.1.1 Rule‑based machine translation (RBMT)

RBMT is the traditional MT approach, which is based on the linguistic rules to allow
words translating in linguistic ways. The linguistic rules originate from human-made ref-
erences, such as dictionaries, or grammars of the languages (i.e., syntactic and semantic
analysis). RBMT initially parses the input sentence to link the structure with the output
sentence’s counterparts. It further analyzes the structure based on linguistic rules. Lastly, it
generates the translation (Forcada et al. 2011). However, RBMT requires massive human
resources and time for the linguistic experts to develop the language rules, thereby dramati-
cally affecting the efficiency and cost of these approaches. Although considerable amounts
of rule-based libraries have been set up, the performance of RBMT is barely good when
coping with massive real data. The instances of RBMT involve the transfer-based machine
translation (Furuse and Iida 1992), inter-lingual machine translation (Nyberg and Mita-
mura 1992), dictionary-based machine translation (Hull and Grefenstette 1996). The prom-
inent RBMT systems are, SYSTRAN1Toma (1977), Apertium2Corbí-Bellot et al. (2005),
GramTrans3 Bick (2007), and others.

1.1.2 Corpus‑based machine translation (CBMT)

With the popularity of the corpus linguistics since the 1980s, people begin to use com-
puters to obtain knowledge from bilingual text corpus automatically. The proliferation of

1
SYSTRAN https://​www.​systr​an.​net/​en/​trans​late/.
2
Apertium https://​www.​apert​ium.​org/​index.​eng.​html.
3
GramTrans https://​gramt​rans.​com/.

13
10140 S. K. Mondal et al.

the corpora and the corresponding corpus-based machine translation significantly changes
the awkward situation of MT and reduces the human cost remarkably. CBMT involves the
statistical machine translation (SMT) and the example-based machine translation (EBMT)
though both approaches regard the bilingual corpus as the base of the translation. In SMT,
it is essential to preprocess the data to obtain the statistics, such as parameters or prob-
abilities. By contrast, EBMT is based on the idea of analogy, and uses the bilingual texts
as its primary data source, in which preprocessing the data is optional (Dajun and Yun
2015; Vandeghinste et al. 2013). Notably, due to the ambiguity of natural languages, some-
times examples or samples are not available to cover the translation. Due to this reason,
the EBMT method has mostly been used as a complementary to other translation systems.
On the other hand, since SMT is built on top of machine learning and the parameters are
derived from the machine learning model, the translation model can be used to multiple
platforms with new languages and/or new fields. We observe that SMT was quite popu-
lar till 2016 and used prominently. For example, Google Translate (in 2016, changed to
NMT),4 Microsoft Translator (in 2016 changed to NMT),5 Moses6 (Koehn et al. 2007).

1.1.3 Hybrid machine translation (HMT)

The HMT is defined by the use of numerous machine translation methods within a single
machine translation system (Costa-Jussa and Fonollosa 2015; Labaka et al. 2014). A single
MT approach may have its pitfalls that could be complemented by other approaches. As
such, people try to combine different MT approaches together to remedy the defects and
enhance translation performance. Several successful systems employ HMT methods, such
as PROMT7 and SYSTRAN.8 The common approaches of HMT involve the Multi-engine,
which runs multiple machine translation systems in parallel. For example, Statistical rule
generation, which uses statistical data to generate lexical and syntactic rules and Multi-
Pass, which serially processes the input multiple times (Costa-Jussa and Fonollosa 2015;
Labaka et al. 2014).

1.1.4 Neural machine translation (NMT)

As the current state-of-the-art paradigm of MT, NMT predicts the likelihood of a sequence
of words by building and training a single large artificial neural network, which is capa-
ble of deep learning (Bahdanau et al. 2014; Wu et al. 2016). The major benefit to NMT
is that the source and the target text can be trained directly in an encoder-decoder system
so that it is fast and accurate. Hence, NMT contradicts the conventional MT approaches,
which consist of many small sub-components that are tuned separately (Bahdanau et al.
2014). It is the state-of-the-art MT approach that fully transcends the performance of SMT
and has rapidly raised interest in academia and industries. The classic neural translation
models involve the sequence-to-sequence model (Sutskever et al. 2014), encoder-decoder
model (Bahdanau et al. 2014), attention mechanism (Cui et al. 2016), transformer (Devlin

4
Google AI Blog https://​ai.​googl​eblog.​com/​2016/​09/a-​neural-​netwo​rk-​for-​machi​ne.​html.
5
Microsoft Translator Blog https://​www.​micro​soft.​com/​en-​us/​trans​lator/​blog/​2016/​11/​15/​micro​soft-​trans​
lator-​launc​hing-​neural-​netwo​rk-​based-​trans​latio​ns-​for-​all-​its-​speech-​langu​ages/.
6
Moses https://​www.​statmt.​org/​moses/.
7
www.promt.com.
8
www.systransoft.com.

13
Machine translation and its evaluation: a study 10141

et al. 2018; Vaswani et al. 2017) etc. The notable applications of NMT in our daily life are
Google Translate (from 2016),9 Microsoft Translator (from 2016),10 Translation on Meta,11
OpenNMT12Klein et al. (2017).

1.1.5 Augmentation of machine translation

We find that NMT lacks the support in low-resource datasets and small size corpora even
though it is dominating the translation industry. We observe that the prominent methods
built on top of the baseline NMT or SMT models can help to minimize the issue and conse-
quently help to optimize the quality of translations significantly. To this end, we elaborate
how the prominent approaches, such as the back-translation (Ramnath et al. 2021) (also
called round-trip translation (Ahmadnia and Dorr 2019; Somers 2005)) and pivot (bridg-
ing) language (Liu et al. 2018; Wu and Wang 2007) can help us optimize the translation
quality having low-resource bilingual dataset or a small size bilingual corpora. We also find
that knowledge graphs (KGs) (Ahmadnia et al. 2020; Bollacker et al. 2008; Lu et al. 2018;
Moussallem et al. 2019; Zhao et al. 2020a) take a wider role in augmenting machine trans-
lation. Not only the aforementioned approaches, but also various recent mechanisms help
to enhance the development of machine translation widely. We demonstrate the approaches
and mechanisms respectfully in Sect. 4 and present their reflection toward MT augmenta-
tion accordingly.
Notably, to be quickly get introduced to the MT methods and their comparative perspec-
tive (also prospects), we present a summary of the illustrated MT methods in Table 1.
We would note that it is important to evaluate whether the MT models meet the quality
expectations. Therefore, we present a summary of commonly-used evaluation methods of
MT models in the subsequent section.

1.2 Evaluation of machine translation

It is obvious that the output of each MT system must meet the quality expectations. To
justify whether MT systems meet the expected quality, we ought to establish reliable evalu-
ation methods for them, manually or automatically. Fundamentally, it is required to find
the compact correlation of any source and target languages where ambiguity exists in most
natural languages. We observe that there are numerous methods for the MT evaluation, as
demonstrated:

1.2.1 Human evaluation

The two most commonly-used human evaluation methods are introduced herein.
Automatic Language Processing Advisory Committee (ALPAC) (Pierce 1966; Wiki
2019a): ALPAC set up the very first evaluation method for MT, whose measurements
mainly focus on intelligibility and fidelity, using human “raters” as judges

9
Machine Translation Research at Google https://​resea​rch.​google/​pubs/?​area=​machi​ne-​trans​lation.
10
Machine Translation Research at Microsoft https://​www.​micro​soft.​com/​en-​us/​resea​rch/​group/​machi​ne-​
trans​lation-​group/.
11
Machine Translation Research at Meta AI https://​ai.​faceb​ook.​com/​resea​rch/​NLP.
12
OpenNMT https://​openn​mt.​net/.

13
10142 S. K. Mondal et al.

(Pierce 1966; Wiki 2019a). The intelligibility is the measurement of whether the
translation is understandable for humans or not. By contrast, the fidelity is the measure-
ment of the information retained for the translation sentence from the source sentence.
Defense Advanced Research Projects Agency (DARPA) (White 1995): DARPA provides
a general standard for MT to assess in three measuring aspects based on human judgments:
Adequacy, Fluency, and Informativeness (White 1995). Adequacy measure-
ment determines how much information of a content is conveyed to the target translation
from the source text regardless of the translation quality. In particular, in the evaluation
process, the translation is compared with the expert reference translations fragment by
fragment and it scores the evaluation on a scale from 1 to 5 based on conveying/carrying
the information in the translations. On the other hand, Fluency measures the translation
text on how good its target translation (e.g., English) is, determining the sentence is flu-
ent or well-form or not, regardless of the information correctness. Notably, the evaluation
score is derived in the same manner as in Adequacy. Last but not least, the Informa-
tiveness measures the ability of a translation system in producing quality translations,
i.e., it carries/conveys/reflects the information about the translation system so that people
can easily obtain enough information about its translation ability. In particular, in the eval-
uation process, a judge first reads the translation of candidate sentences or documents, and
then the judge is asked to answer multiple questions about the ability of the translation sys-
tem in producing quality translations after reading the translations. Finally, the evaluation
is scored as a fraction of correct answers over the total questions.

1.2.2 Automatic evaluation

Human evaluation methods possess low efficiency and they are cost expensive. To this end,
researchers and developers turn to automatic evaluation for seeking a better and more con-
venient approach to evaluate MT systems. We briefly present the popular and commonly-
used automatic evaluation methods as follows.
Bilingual Evaluation Understudy (BLEU) (Papineni et al. 2002): BLEU is the most com-
monly used and was introduced to break the bottleneck of human evaluation methods. In
the evaluation, it measures the similarity, intuitively the short sequences of words (regarded
as the N-gram of words), between the MT output with professional human translation ref-
erences (Papineni et al. 2002). Notably, BLEU score is based on precision while ana-
lysing word N-grams which appear simultaneously between the candidate translation and
any translation reference(s).
National Institute of Standards and Technology (NIST) (Doddington 2002): NIST is
based on BLEU, and it consequently involves the N-gram concept. However, NIST intro-
duces the Informativeness of each N-gram word, which considers that the more
weight it will be assigned (Doddington 2002) if an N-gram word occurs the less frequently
in the candidate translation.
Metric for Evaluation of Translation with Explicit ORdering (METEOR) (Banerjee and
Lavie 2005): METEOR is one of the popular metrics for MT evaluation, like BLEU. Nota-
bly, METEOR was introduced to address some of the issues exist in the NIST and BLEU
metrics (Banerjee and Lavie 2005; Lavie et al. 2004). For instance, METEOR (F-score
based metric) takes into account the more factors compared with BLEU and NIST, such as
recall, explicit word-to-word matching, and others.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin 2004; Lin and Hovy
2003; Lin and Och 2004): ROUGE metrics are used for the evaluation of MT and text

13
Machine translation and its evaluation: a study 10143

Table 1  Summary of Machine Translation Methods


MT Method Summary

Rule-based Machine Translation (RBMT) Methodology: The RBMT method is based on the linguistic
rules to allow words translating in linguistic ways, though
rules originate from human-made references, such as
dictionaries, or grammars of the languages (syntactic and
semantic analysis). The instances of RBMT involves the
transfer-based machine translation (Furuse and Iida 1992),
inter-lingual machine translation (Nyberg and Mitamura
1992), dictionary-based machine translation (Hull and
Grefenstette 1996).
Examples: SYSTRAN (Toma 1977), Apertium, GramTrans
(Bick 2007), and others.
Pros: Do not need bilingual text, performance is good, rules
are reusable with new new languages.
Cons: Required massive human resources with expertise
and time which affect the efficiency and cost, i.e., rules are
manually set and need to have good dictionaries.
Corpus-based Machine Translation (CBMT) The CBMT involves the statistical machine translation
(SMT) and the example-based machine translation
(EBMT). Both these approaches regard the bilingual
corpus as the base of the translation (Dajun and Yun 2015;
Vandeghinste et al. 2013).
SMT: SMT is built on top of machine learning and the
parameters are derived from the machine learning model,
the translation model can be used to multiple platforms
with new languages and/or new fields. The basic idea of
SMT is to obtain target translation through a bilingual
or monolingual corpus analysis, using certain statistical
models.
EBMT: EBMT is based on the idea of analogy. Since
most of the natural languages are ambiguous, sometimes
samples are not available to cover the translation. Due to
this reason, it is mostly used as a complementary to other
translation systems.
Pros: Significantly optimizes human cost, efficiency, and
time.
Hybrid Machine Translation (HMT) The HMT is the combination of RBMT and SMT. A single
MT approach may have its pitfalls that could be comple-
mented by combining different MT approaches together. In
general, HMT involves multi-engine which runs multiple
translation systems in parallel and accordingly helps to
enhance the translation performance. Several success-
ful systems employ HMT methods, such as PROMT and
SYSTRAN (Costa-Jussa and Fonollosa 2015; Labaka et al.
2014).
Pros: Optimizes the translation performance and quality.
Cons: Need enormous human editing and human transla-
tions.

13
10144 S. K. Mondal et al.

Table 1  (continued)
MT Method Summary

Statistical Machine Translation (SMT) The SMT is famous for its availability of multi-platform and
steady output. As stated earlier, SMT is corpus-based, it
requires a source language corpora and a parallel target
language corpora, which is translated precisely by humans
(Hardmeier 2012). It improves on RBMT. However, it
encounters many of the same issues. Note that there are a
couple of variations of SMT models, such as Word-based
translation, Syntax-based translation, Phrase-based transla-
tion, Hierarchical phrase-based translation, and others.
Examples: Google Translate (in 2016, changed to NMT),
Microsoft Translator (in 2016 changed to NMT), Moses
(Koehn et al. 2007).
Neural Machine Translation (NMT) NMT is the current state-of-the-art paradigm of MT system.
It addresses the limitations of RBMT and SMT methods
using deep learning algorithms (Bahdanau et al. 2014;
Wu et al. 2016). The major benefit to NMT is that the
source and the target text can be trained directly in an
encoder-decoder system so that it is fast and accurate. The
classic neural translation models involve the sequence to
sequence model (Sutskever et al. 2014), encoder-decoder
model (Bahdanau et al. 2014), attention mechanism (Cui
et al. 2016), transformer (Vaswani et al. 2017; Devlin et al.
2018) etc.
Examples: Google Translate (from 2016), Microsoft Trans-
late (from 2016), Translation on Meta, OpenNMT (Klein
et al. 2017).

summary. The metrics perform a comparative analysis of automatically generated transla-


tion or summary against a reference (or a set of references) (Lin 2004; Lin and Hovy 2003;
Lin and Och 2004).
We observe that the larger the values of BLEU, NIST, METEOR, or ROUGE, the bet-
ter the system performs. By contrast, we find other two metrics, such as TER (Snover et al.
2006) (Translation Edit Rate - the amount of post-editing (PE) required for MT tasks), WER
(Klakow and Peters 2002) (Word Error Rate, also called distance metric), the lower the score,
the better the system. Notably, there are some other automatic evaluation metrics used in the
community, especially, specific metrics for specific purposes. We observe that BLEU is the
most commonly used metric and maintains high reliability compared with human evaluation.
However, it only takes the precision into account and assigns equal weight to all N-grams
which may cause poor evaluation. Next, NIST has similar issues to BLEU although it is added
with an informativeness feature, which gives more importance to the less frequent N-grams,
thereby helping to optimize evaluation. Thereafter, it comes the METEOR, which is also quite
popular like BLEU. Notably, it helps to address some of the issues of BLEU and NIST with
added features, such as stemming and synonymy matching. However, this approach is naive,
so further enhancement is required. Also, it only takes unigrams into the analysis. Now, it
is the ROUGE metric that performs a comparative analysis of automatically generated trans-
lation against a single reference or multiple references and can achieve comparable perfor-
mance. However, it is more familiar for text summary evaluation. On the other hand, it is only

13
Machine translation and its evaluation: a study 10145

flexible with string-to-string matching, and synonymy and paraphrasing matching are not
taken into the analysis as in METEOR.

1.3 Motivation

Nowadays, the importance of machine translation is non-negligible, including its influence on


social meaning and economic value. Notably, machine translation is not only the language
translation - it successfully implements a considerable number of functionalities. For exam-
ple, text-to-speech, interpreter telephony, speech-to-speech, phonetic input methods, and even
automatic composition for music or a poem. Moreover, even though the internet has broken
the barrier of distance while people communicate, the language barrier still exists. This barrier
has become the stumbling block of information circulation. Notably, in our daily life, various
machine translation software, apps, and portals help us in various ways to mitigate, minimize,
or address the language barrier. For example, Google Translate, Microsoft Translator, DeepL,
iTranslate, and many others. These apps help us translate a source language to a target lan-
guage. However, they are limited within a set of languages and many target translations are
not accurate. In addition, if we reverse-translate the target sentence to the source sentence, the
mapping is not symmetric in many cases. Therefore, the development of a large-scale MT sys-
tem can fully comprehend the different levels of language processing problems and the tech-
niques to overcome them. In essence, we explore the following specific areas in great detail:

1. We present the two mainstream MT approaches, SMT (Sect. 2) and NMT (Sect. 3) and
their following translation models. Besides, we present comparative analysis among
the SMT approaches (Sect. 2.6) and NMT approaches (Sect. 3.4) separately and also
between the SMT and NMT approaches while finding the most adopted ones being used
by the researchers and the developers.
2. We demonstrate a set of prominent MT methods, which are built on top of NMT or
SMT baseline models while addressing the existing issues in the baseline models and
finally optimizing the translation accuracy. Besides, we include the knowledge graphs
and recent studies that focus on the advances of Machine Translation (Sect. 4).
3. We also demonstrate a comparative analysis between SMT and NMT from different
perspectives as discussed in Sect. 5 while finding the most popular ones being used
by the community, the researchers, and the developers. In short, why NMT is the most
used?
4. We likewise present the commonly used Corpora and Datasets used for the evaluation
of the different SMT and NMT approaches illustrated in this paper (Sect. 6).
5. We also demonstrate the issues of ethical pitfalls and gender bias in Machine Translation
(Sect. 7).
6. We present the evaluation methods of MT, including the human evaluation methods and
automatic evaluation methods (Sect. 8).

We finally conclude the paper by summarizing our analysis.

13
10146 S. K. Mondal et al.

2 Statistical machine translation

Statistical machine translation, one of the mainstream methods of machine translation


scheme, is famous for its availability of multi-platform and steady output. The basic idea
of SMT is to obtain target translation through a bilingual or monolingual corpus analysis,
using certain statistical models (discussed later part of this paper). SMT is corpus-based,
it requires a source language corpus and a parallel target language corpus, which is trans-
lated precisely by humans (Hardmeier 2012). The first concept of SMT was put forward by
Warren Weaver, an American Mathematician. He first proposed the statistical model in his
memorandum in 1949 (Shannon and Weaver 1949). However, the the model was stalled
over 40 years and later in 1993 it was put forward by IBM’s Thomas J. Watson Research
Center. This word-to-word statistic model (a noisy channel model) is then called the IBM
Model 1 (Brown et al. 1993). In the following years, the model is upgraded to IBM Model
5 (Brown et al. 1993; Koehn 2009).

2.1 The noisy channel model in machine translation

Noisy channel is to remove noise. For example, in machine translation, the model needs to
remove the semantic or syntactic errors of the translation, the noisy channel can help han-
dle the issue (in this domain, it also does the corrections). In SMT, usually, the noisy chan-
nel model is considered as the basis, shown in Fig. 1. Given a source language sentence f,
the translation systems aimed to put this sentence f into a noisy channel to obtain a paral-
lel (equivalent) translation sentence e and a probability P(f∣e) as the translation model.
Similarly, given a target language sentence e, the translation system can find the sentence
in source language f, using the highest probability obtained. This is known as the decoding
process. The decoding problem is to find the target sentence e that f could have generated
with the highest probability. To find the optimal e, the Bayes’ rule is used:
e∗ = argmaxe p(e ∣ f) = argmaxe p(e)p(f ∣ e),

where p(e) is a language model that models the quality, such as the fluency of the sen-
tence. A language model represents the probability of a sentence showing up in the target
language. Intuitively, the language model determines whether a sentence is reasonable or
not in the target language, based on the grammatical rules of the target language. p(f ∣e)
models the degree of trustworthiness of the translation (Ahmed and Hanneman 2005). On
the other hand, translation model is how probability distribution p(f ∣e) is factored to give
probabilities to unseen pairs. Notably, SMT method has three different periods of transla-
tion models, such as word-based, phrase-based, and syntactic-based.

2.2 Word‑based machine translation

Translating a word in the way of word-based MT is like looking up in the dictionary (lexi-
cal translation), which means word is the fundamental unit of word-based MT. Initially, the
MT system selects the words from the parallel corpora, whereas multiple choices of trans-
lation often appear because of word ambiguity that exists in different languages. For exam-
ple, a simple English word “hello” might be translated into different forms in Chinese.
Therefore, the MT system must estimate the translation probability to obtain the maximum

13
Machine translation and its evaluation: a study 10147

likelihood estimation (Koehn 2009). Notably, MT system aligns words between the source
and the target language, using a generative model. For instance, i is denoted as the posi-
tion of the target language (English) word and j as the position of the source language
(German) word:
𝚊 ∶ 𝚒 → 𝚓.

For these two languages, we have the example sentence where the word positions are num-
bered as 1 to 4 (Fig. 2). The alignment function of the example sentences is:
a ∶ {1 → 1, 2 → 2, 3 → 3, 4 → 4}

In Fig. 2, we can see that the target language words are generated by the source lan-
guage words. However, words may be dropped or inserted during translation. Hence, a
word may not be one-to-one alignment (Fig. 3). This phenomenon is called word reorder-
ing. IBM Model 1 to 5 tokenize the sentences. Besides, the models carry out recording,
training, and summarizing the tokenized tags, and implement the decoding and reordering
process (Koehn 2009). Notably, GIZA++ package is one of the word-based translation sys-
tems, being used widely by developers. It includes the essential models of word-based MT
(IBM Models, HMM Models, etc.) Och and Ney (2003).
Although the word-based developments are astonishing and revolutionary at that period,
it still lacks robustness, on which the translation systems cannot obtain enough accurate
outputs. For example, parts of speech, word lattices, and synonyms are hard to be dealt
with while using the word-based translation system - the expressiveness of information
varies across the parts of speech, as well word lattices are computationally complex, and it
is difficult to derive the exact synonym. Furthermore, each word of the source language can
be only translated in one way. The diversity of translation of arts thus failed to be achieved.
In the following years, phrase-based translation, syntax-based translation, and hierarchical
phrase-based translation are introduced to avoid the different shortcomings of these transla-
tion methods.

2.2.1 Translation models

The translation models represent the probabilistic distribution between the translation pro-
cess of the source language S and the target language T. They are sophisticated.
Interlude: The Expectation-Maximization Algorithm: The purpose of the Expectation-
Maximization Algorithm (the EM Algorithm) is to compute maximum likelihood esti-
mates (parameter estimation) from incomplete data (Dempster et al. 1977). The EM algo-
rithm is widely applicable in SMT, which contains a lot of probabilistic models, such as
the Markov models, the Bayes’ networks, and many others. Intuitively, the EM Algorithm
can be considered as the chicken-or-egg problem - “A hen is only an egg’s way of making
another egg.” by Samuel Butler. The EM algorithm involves two steps: An expectation step
and a maximization step. The expectation step concerns about unknown variables in the
current estimation of the parameters and is determined by the observations. In contrast, the
maximization step provides new estimates of the parameters, according to the expectation.
Note that single or multiple estimates are not enough to obtain the ultimate output, and
consequent iterations are needed until convergence (Moon 1996).
In the context of the chicken-or-egg problem, the EM algorithm works in the way, as
demonstrated before. Here, the existing data works like a chicken. With the help of the

13
10148 S. K. Mondal et al.

Fig. 1  The Noisy Channel


Model, Ahmed, Hanneman,
2005, reproduced from the infor-
mation in Ahmed and Hanneman
(2005)

Fig. 2  a: { 1 → 1, 2 → 2, 3 → 3, 4
→ 4 }, Koehn, 2009, reproduced
from the information in Koehn
(2009)

Fig. 3  a: { 1 → 1, 2 → 2, 3 → 3,
4 → 0, 5 → 4 }, Koehn, 2009,
reproduced from the information
in Koehn (2009)

expectation step, the existing data help us estimate unknown, missing, or latent data. On
the other hand, the maximization step helps us to optimize the estimated data, it is kind of
laying off the grown egg i.e., nurturing the data. Similarly, the egg can be new data or a
new chicken that can help us estimate new unknown data and it goes until the convergence.
In the following, we can see the reflection in the IBM models.
IBM Model 1:In word-based MT, it is of great distinctiveness that parses the sentences
into words. The IBM Models, is a series of five parameter estimation models, with the
essential algorithms. These models compute the likelihood of the translation (f,e),
which is equivalent to p(f∣e). Word alignment depicts the relationship between the words
in the source language and the target language. IBM Model 1 simply uses lexical transla-
tion and aligns the parsed (tokenized) words of the source and target language (Fig. 2).
Unlike other models, Model 1 does not think about the order of the words, whereas it
means that Model 1 can only align words in a certain order (Brown et al. 1993).
Model 1 involves the Expectation-Maximization Algorithm to obtain the translation
probability. Besides, it is used to obtain the alignment probability and to estimate the
parameters of the generative model. Figures 4, 5, 6 and 7 depicts the process of the EM
algorithm. When the data is converged, the algorithm gets the probability of each align-
ment. In this example, p(la ∣ the) yields 0.453, which is the highest probability of all
alignments (Koehn 2009).
Model 1 shows the primal estimates of the parameters, however, it is not an accu-
rate translation model. It is weak in reordering, removing, or inserting words. In most

13
Machine translation and its evaluation: a study 10149

Fig. 4  [EM Algorithm] First Iteration: Look for the most frequent word pairs, Koehn, 2009, reproduced
from the information in Koehn (2009)

Fig. 5  [EM Algorithm] Second Iteration: Learn more about the possible alignment, Koehn, 2009, repro-
duced from the information in Koehn (2009)

Fig. 6  [EM Algorithm] Third Iteration: Alignments become apparent, Koehn, 2009, reproduced from the
information in Koehn (2009)

Fig. 7  [EM Algorithm] Fourth Iteration: Convergence, Koehn, 2009, reproduced from the information in
Koehn (2009)

situations, one source word will be translated into one single word, but sometimes one
word could be translated into multiple words, or even drop the word during translation. The
following models try to fix these problems.
IBM Model 2: In comparison to IBM Model 1, Model 2 adds an alignment model so
that the order of the words is adjusted (Fig. 8), notably uses the estimates generated in

13
10150 S. K. Mondal et al.

Fig. 8  IBM Model 2 - One more


step, Koehn, 2009, reproduced
from the information in Koehn
(2009)

Model 1. Intuitively, Model 2 ‘memorizes’ the common word place of the output sen-
tences, which in a way enhances the performance of translation, yet still not efficient
enough, leads to unsatisfactory alignments.
IBM Model 3: IBM Model 3 adds a model of fertility. Fertility is the number of Eng-
lish words generated by a foreign word (Koehn 2009), a notion that input words would
produce a specific number of output words after translation. As in Model 2, the prob-
ability of the connection is affected by the connection position and the length of the
sentence (string). By adding the fertility, Model 3 successfully considers the one-to-
many relation of the source language and target language. The fertility is modeled by
the distribution, as shown by
n(𝜙 ∣ 𝚏),

where 𝜙 denotes the number of the words generated by a foreign word f. For example,
in Fig. 9, n(1∣haus), n(2∣zum), and n(0∣ja) are similarly equal to 1 (Koehn 2009), where
German word zum generates two English words and German word ja is dropped. As a
compensation of the dropped word, sometimes a NULL token is inserted, as an extra task
or process of Model 3. The last step, i.e., the distortion (reordering) step, is quite similar to
the alignment, except that some different translations can be generated by the same align-
ments. Principally, the distortion step follows the translation direction, i.e., predicting an
output word position according to the input word position. However, Model 3 is still defi-
cient, as it cannot deal with the many-to-one relation.
IBM Model 4 And Model 5: IBM Model 4 has better reordering performance. For
instance, a translated phrase may appear at a spot in the source language sentence different
from that the corresponding target (e.g., English) phrase appears in the target (e.g., Eng-
lish) sentence. Model 4 alleviates this problem by adding a relative reorder model. How-
ever, we observe that the models 1–4 are persistently deficient where some words lie on
top of one another or are placed before the first position or beyond the last position. Also,
Model 4 sometimes has positive probabilities of many implausible/ridiculous translations.
Moreover, the IBM Model 1 to 4 usually wastes a significant amount of the probability
mass - the models restrict/limit us derive (further) representative samples or infer valid
probability distributions, since the models assign high/positive probability distribution/
value to many impossible outcomes/alignments/translations. In other words, a subset of
the distributions is not significant, not relevant, not useful, and not reasonable. IBM Model
5 addresses these deficiencies, by tracking the available positions of words in the source
language. As we can see, IBM Models can successfully create the many-to-one mapping
in word alignment, whereas in real word alignments, there are many-to-many mappings
(Koehn 2009).
Summary:

– IBM Model 1: It performs lexical translation, so lacks in translation accuracy.

13
Machine translation and its evaluation: a study 10151

Fig. 9  IBM Model 3 - Adding the fertility, Koehn, 2009, reproduced from the information in Koehn (2009)

– IBM Model 2: It adds absolute reordering, but suffers from unsatisfactory alignments
of translations.
– IBM Model 3: It adds fertility model, however can not handle many-to-one relation-
ship.
– IBM Model 4: It ensures a better reordering making it relative, but it is attached with
numerous issues as discussed in the aforementioned section.
– IBM Model 5: It fixes deficiency of other IBM models, but it lacks handling many-to-
many mappings.

2.3 Syntax‑based machine translation

The embryo of syntax-based machine translation was first presented in 1990, by Shie-
ber and Schabes (Shieber and Schabes 1990, 1991). Notably, syntax-based MT have not
received much attention and acknowledgement until the beginning of the 21st century,
because in the 1990s, phrase-based MT was rather more popular and syntax-based MT had
poor performance compared to phrase-based MT (Och et al. 2003). However, some defects
that other MT approaches that hard to be resolved made people focus more on syntax-
based MT. Comparing to the word-based MT (the IBM Models), syntax-based MT models
produce better words alignments because it solves the defect of the word-based MT that it
can only handle with a structurally similar language pair. In syntax-based MT, syntactic
order and its correspondence analysis are of great importance, which is the core of this
translation method. For example, reordering objects or subjects to other parts of the sen-
tence. The syntax-based models are often established directly from Penn Treebank (PTB)
(Marcus et al. 1993) style parse trees by composing treebank grammar rules. The treebanks
help the models learn to judge or reorder the sentence in certain grammatical rules. Fig-
ure 10 depicts an intuitional syntax tree. In this tree we can see that the English sentence

13
10152 S. K. Mondal et al.

“the house of the man is small” is parsed into tags of its syntactic structures by a statistical
parser, where NP denotes noun phrase, VP denotes verb phrase, and so on Koehn (2009a).

2.3.1 Grammar formalisms of syntax‑based machine translation

In this section, two basic grammar formalisms that syntax-based MT rely on are intro-
duced, which are synchronous context-free grammars and inversion transduction gram-
mars. These formalisms are regarded as the basis when researchers are developing the syn-
tax-based approaches because of their excellent generalization abilities.
Synchronous Context-Free Grammars: The synchronous context-free grammars
(SCFGs) (Chiang and Knight 2006; Wong and Mooney 2007) borrow and broaden the
idea of the context-free grammars (CFGs). The SCFGs generate a pair of related strings,
instead of generating a single string in the CFGs. Hence, SCFGs are capable when one
wants to compare syntactic characteristics between two languages. A CFG is a certain
grammar that describes formal languages, constituted by a set of production rules. CFGs
are recursive if a tag on the right hand side is the same as a tag on the left hand side. In
comparison, the productions in SCFGs have two right-hand sides which are the source and
target side, respectively. Notably, the SCFGs derivations can also be viewed as a parse tree
and the performance of SCFGs is comparable with some state-of-the-art phrase-based MT
approaches.
Inversion Transduction Grammars: It is one of the earliest approaches of syntax-based
MT history (Wu 1997). It is a formalism for modeling bilingual string pairs, incorporating
a context-free model. In fact, ITGs are very similar to SCFGs, but it simplifies the con-
straints of the input strings. One of the pitfalls of CFGs or SCFGs is that sometimes the
input language pairs should share the exact same grammatical structures; therefore, ITG
removes the constraint of the parallel ordering. The Normal form of ITG involves three
production rules:
𝚇 → 𝚎∕𝚏 (1)

𝚇 → [𝚈𝚉], (2)
and
𝚇 → ⟨𝚈𝚉⟩ (3)
Rule (1) is responsible for generating the word pairs, where e denotes the production of
the first input string and f denotes the production of the second input string. Rule (2) and
(3) are responsible for generating the syntactic subtree pairs. ITG in a way eliminates the
limitation of the grammatical structure, covering a majority of the language syntactic dif-
ferences. However, ITG does not explicitly stipulate which order to apply, causing ambigu-
ity sometimes and lowering the translation performance.
Tree to String Model: The tree to string model (Yamada and Knight 2001) is inspired by
the IBM alignment Models. Yamada and Knight notice the criticisms that the IBM Models
fail to model the languages’ structural or syntactic aspects. For the tree to string model,
they also involve the noisy channel model, which receives a parse tree for input (whereas
the output forms are strings). The channel operates on each node of the parse tree, includ-
ing reordering the nodes, inserting extra words at nodes, as well as translating leaf words.
Because of these operations, the tree to string model is capable to model languages with
linguistic differences (grammatical or syntactic aspects), including SVO-languages

13
Machine translation and its evaluation: a study 10153

Fig. 10  A Simple Syntax Tree, Koehn, 2009, reproduced from the information in Koehn (2009a)

(English or Chinese) and SOV-languages (Japanese or Turkish). Assume that we need to


translate Japanese from English; hence, an English parse tree is input to the noisy channel
model. Figure 11 depicts how this model works.
Firstly, after inputting the parse tree, the model stochastically reorders all child nodes
under each internal (must be non-terminal) node. If a node has N children, then N! reor-
ders exist. In this example, node VB has three child nodes, PRP, VB1 and VB2, respec-
tively, and the child nodes may have their own child node(s). Next, the model stochastically
inserts an extra word to each node. The insertion may occur on the left or right hand side of
the node, or no insertions. Lastly, translation is applied on each leaf. Each operation has its
own probability, which is estimated using the EM Algorithm. The probabilities of the oper-
ations are shown in the sub-tables of Table 2, where r-table, n-table, and t-table account for
the probability of the reordering, insertion, and translation respectively.
Dataset and Experimental Analysis: Yamada et al. in Yamada and Knight (2001) fol-
lowed a Japanese-English dictionary to perform the evaluation of their approach and col-
lected a corpus consisting of 2121 (translation sentence) pairs from the dictionary. They
show a comparative analysis with IBM Model 5. Consequently, they also performed the
training on the IBM Model 5 with the same collected corpus. In evaluation, the authors
show that their model generates a better alignment score than IBM Model 5 and achieves
better results in counting the perfectly aligned sentence pairs. Notably, they performed the
counting analysis in the fifty (50) pairs of sentences. The comparative analysis is shown in
Table 3. We observe that Yamada’s model outperforms IBM Model 5 in both aspects of
alignment average score and perfectly aligned sentence count.

2.4 Phrase‑based machine translation

In 2003, the introduction of phrase-based machine translation (Koehn et al. 2003) signifi-
cantly improved the performance of machine translation. Today, the phrase-based MT has
already become one of the novel MT approaches in the world. Trending products such as

13
10154 S. K. Mondal et al.

Fig. 11  Channel Operations of Tree to String Model: Reorder, Insert, and Translate, Yamada and Knight,
2001, reproduced from the information in Yamada and Knight (2001)

the Google Translate13 adopted this approach (As stated earlier, moved to NMT
since 2016). Instead of word in word-based MT and syntactic unit in syntax-based MT,
phrased-based MT translates phrases (strings of words) as the fundamental unit, compared
to the word-based MT. Hence, phrase-based MT can be regarded as a complete upgrade of
word-based MT. The phrase-based approach considers arbitrary consecutive strings as the
phrase(s), automatically learning bilingual phrase pairs from the word-alignment bilingual
corpus. Figure 12 depicts the basic concept of the phrase-based machine translation.
We clearly see that the German phrase spass am is related to English phrase fun
with the. In the word alignment models of the word-based approach, it is easy to omit
the English article the when the models is trying to relate the German word am and
the English word the. However, the phrase-based approach example here alleviates this
defect. Therefore, the phrase-based approach masters the local context dependencies, and
also solves the issue of the word order in the phrase. What is more, comparing to the syn-
tax-based approach, the phrase-based approach has more applicability and occupies less
searching space.

2.4.1 A phrase‑based model

It is established on the noisy channel model we mentioned at the former section.


e∗ = argmaxe p(e ∣ f) = argmaxe p(e)p(f ∣ e),

Where f denotesI the foreign language and e denotes the English translation. The input I
for-
eign language f 1 is segmented into a sequence of I phrases, which is equivalent to f 1 = f1
,..., fi,..., fI . Next, every single phrase is translated into English phrase ei . The model selects
the phrases and constitutes them into sentences with the highest probabilities among all the
candidate translated phrases. The probability distribution of the translated phrase can be

13
Google Translate https://​trans​late.​google.​com/.

13
Table 2  Probabilities of three operations, Yamada and Knight, 2001 Yamada and Knight (2001)
r-table
Original order reordering P(reorder)

PRP VB1 VB2 0.074


PRP VB2 VB1 0.723
PRP VB1 VB2 VB1 PRP VB2 0.061
VB1 VB2 PRP 0.037
VB2 PRP VB1 0.083
VB2 VB1 PRP 0.021
VB TO VB TO 0.251
TO VB 0.749
Machine translation and its evaluation: a study

TO NN TO NN 0.107
NN TO 0.893
⋮ ⋮ ⋮
n-table

parent Top VB VB VB TO TO ⋯

node VB VB PRP TO TO NN ⋯

P(None) 0.735 0.687 0.344 0.709 0.900 0.800 ⋯


P(Left) 0.004 0.061 0.004 0.030 0.003 0.096 ⋯
P(Right) 0.260 0.252 0.652 0.261 0.007 0.104 ⋯
w P(ins-w)

ha 0.219
ta 0.131
wo 0.099
no 0.094
10155

ni 0.080

13
Table 2  (continued)
10156

w P(ins-w)

13
te 0.078
ga 0.062
⋮ ⋮
desu 0.0007
⋮ ⋮
t-table
E adores he i listening music to ⋯

J daisuki 1.000 kare 0.952 NULL 0.471 kiku 0.333 ongaku 0.900 ni 0.216 ⋯
NULL 0.016 watasi 0.111 kii 0.333 naru 0.100 NULL 0.204
nani 0.005 kare 0.055 mi 0.333 to 0.133
da 0.003 shi 0.021 no 0.046
shi 0.003 nani 0.020 wo 0.038
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
S. K. Mondal et al.
Machine translation and its evaluation: a study 10157

Table 3  Alignment average score and perfectly aligned sentence count. Yamada’s model outperforms IBM
Model 5 in both aspects of alignment average score and perfectly aligned sentence count, Yamada and
Knight, 2001, reproduced from the information in Yamada and Knight (2001)

Alignment average score Perfectly


aligned sen-
tence count

Syntax-based Model 0.582 10


IBM Model 5 0.431 0

Fig. 12  A phrase-based model,


Koehn, 2009, reproduced from
the information in Koehn (2009)

Table 4  Phrase translations for Translation Prob-


German word natuerlich, ability
Koehn, 2009, reproduced from 𝜙(e|f )
the information in Koehn (2009)
of course 0.5
naturally 0.3
of course, 0.15
, of course, 0.05

Table 5  Phrase translations for English 𝜙(e|f ) English 𝜙(e|f )


German word natuerlich,
Koehn, 2009, reproduced from
the proposal 0.6227 the suggestions 0.0114
the information in Koehn (2009)
’s proposal 0.1068 the proposed 0.0114
a proposal 0.0341 the motion 0.0091
the idea 0.0250 the idea of 0.0091
this proposal 0.0227 the proposal, 0.0068
proposal 0.0205 its proposal 0.0068
of the proposal 0.0159 it 0.0068
the proposals 0.0159 . 0.0068

denoted as 𝜙(f i ∣ ei ) (Koehn et al. 2003). We can generate a phrase translation table using
the probabilities 𝜙(f i ∣ ei ) (Table 4 and 5).
We can clearly see that the model estimates the probabilities of all possible phrases,
including the lexical variation, morphological variation, the function words such as the

13
10158 S. K. Mondal et al.

Fig. 13  Distance-Based reordering, Koehn, 2009, reproduced from the information in Koehn (2009)

Fig. 14  English-to-Swedish
Alignment Matrix, repro-
duced from the information in
Sara Stymne (2022)

preposition and the article words, and even the punctuation marks. Therefore, the model is
not limited to linguistic phrases, which is the characteristic of the phrase-based MT.
Distance-Based Reordering: The English output phrases are reordered to obtain the
optimal quality of translation. The reorder session is defined by a probability distribu-
tion d(ai - bi−1) which is relative distortion, ai is the begin position in translation which
translates the foreign phrase into i English phrase, bi−1 is the end position in translation
which translates the foreign phrase into i English phrase (Koehn et al. 2003). Figure 13
depicts an example of distance-based reordering.
According to the Bayes rule of noisy channel model, the phrase translation probabil-
ity 𝜙 , and the reordering probability d, the translation model p(f ∣ e) can be decomposed
to:

13
Machine translation and its evaluation: a study 10159

Fig. 15  Swedish-to-English
Alignment Matrix, repro-
duced from the information in
Sara Stymne (2022)

Fig. 16  Intersection of the Align-


ment Matrix, reproduced from
the information in Sara Stymne
(2022)

I
I I

p(f 1 ∣ e1 ) = 𝜙(f i ∣ ei )d(ai − bi−1 )
i=1

Phrase Pair Extraction: Phrase pair extraction has always been the challenge of the
phrase-based MT. Firstly, aligns the words bidirectionally from the parallel corpus using
IBM Models or GIZA++ toolkit. Figure 14 and Fig. 15 depict the English-to-Swedish
alignment matrix and the Swedish-to-English alignment matrix, and Fig. 16 and 17 depict
the intersection and union of these matrices respectively.
By combining these two matrices together, we can obtain the intersection and union
of the matrices. We get a high-precision alignment of high-confidence alignment points
for the intersection and a high-recall alignment with additional alignment points for the
union (Koehn et al. 2003). An expansion heuristics approach that explores the space

13
10160 S. K. Mondal et al.

Fig. 17  Union of the Align-


ment Matrix, reproduced from
the information in Sara Stymne
(2022)

Fig. 18  Expansion Heuristics of Word Alignments, reproduced from the information in Sara Stymne (2022)

between intersection and union optimizes the output of the word alignments (Fig. 18).
The heuristics begins with the intersection of two word alignments and adds additional
alignment points which are in the union of two word alignments (Och et al. 1999). The
additional alignment point must relate to at least one unaligned word. Initially, the iter-
ation starts from the top right corner of the alignment matrix, expanding only to the
directly adjacent alignment points, until there are no adjacent alignment points avail-
able. Lastly expand the non-adjacent alignment point, and the new alignment point must
relate to at least one unaligned word, either.
After alignments and phrase pairs extraction, consistent with word alignments are
extracted. Note that all the phrase pair words need to align to each other, which means

13
Machine translation and its evaluation: a study 10161

Fig. 19  Consistent of the Phrase Pair Extraction, reproduced from the information in Sara Stymne (2022)

Table 6  PBMT vs. WMBT Language Pair Phrase-based MT IBM Model


(comaprsion is shown with (PBMT) 4 (WBMT)
respect to BLEU score). We
observe that PBMT outperforms
English-German 0.2361 0.2040
WBMT in each language pair,
reproduced from the information French-English 0.3294 0.2787
in Koehn et al. (2003) English-French 0.3145 0.2555
Finnish-English 0.2742 0.2178
Swedish-English 0.3459 0.3137

the words and their corresponding alignment points must be in the same phrase pair,
but not outside of the phrase pair. An example of the phrase pairs extraction is shown in
Fig. 19.
When the extractions are done, probabilitites to phrase translations are assigned using
phrase pair scoring. Finally, it can be scored by relative frequency:

count(f , e)
𝜙(f ∣ e) = ∑
f count(f , e)

Datasets and Experimental Analysis: The authors Koehn et al., 2003 in Koehn et al.
(2003) used the publicly available Europarl14 corpus (Koehn 2005) to perform the evalua-
tion of their approach. In their evaluation, they showed a comparison between their PBMT
and WBMT. Their analysis proofs that PBMT outperforms WBMT in a greater margin, as
shown in Table 6.

2.5 Decoding

In statistical MT approaches, the translation model reflects the probabilities of the source
language translating to the target language. The language model reflects the rationalities on
grammatical rules, such as fluency. The decoder and its correspondence algorithms decode

14
Europarl Datasets https://​www.​statmt.​org/​europ​arl/.

13
10162 S. K. Mondal et al.

Fig. 20  Translation Options for the Decoder, Koehn, 2009, reproduced from the information in Koehn
(2009b)

the inputting source language based on the set-up models and the estimated parameters,
and ultimately obtain the translation. Combining with the noisy channel model, e = arg-
max e p(e)p(f∣e), the tasks of the decoder is to find best translation e with the highest prob-
abilities. If the search process does not find the optimal (or accurate enough) translation,
then the decoding would be bad.
Initially, the decoder lists all the possible translations for each translation unit (depends
on the selected translation approach, such as word and phrase). Figure 20 depicts the trans-
lation options of a short German sentence. Since there are numerous options, the decoder
does not know the correct translations. In general, there are many different approaches or
techniques to do of finding the best translation. In particular, the approaches are known
“heuristic search” methods because they are heuristic in nature and do not assure that the
best translation will be found. They work in a way that can find the closed enough transla-
tion. Notably, one of the commonly used approaches is Heuristic Beam Search.

2.5.1 Heuristic beam search

The decoder that Koehn et al. introduced (Koehn et al. 2003) involves a beam search algo-
rithm, and it can solve the problem. The English output sentence generates from left to
right in the form of partial translations or hypotheses. Firstly, an initial empty hypothe-
sis is generated, which covers no input words and produces no output translation. Next,
new hypotheses expand from this initial hypotheses, choosing all the translation options.
The new hypotheses are generated from the created partial hypothesis, until the last unit is
reached. Then, the decoder updates the probability cost of the hypotheses and searches the
hypotheses with no untranslated foreign units that have the highest probabilities (the cheap-
est hypotheses) for the ultimate output (Koehn et al. 2003). Figure 21 shows an example
process.
Hypotheses Recombination: The hypotheses are stored in stacks. When the expansions
of the hypotheses are applied, the expansions are jumping among the stacks (Fig. 22). The
stack size is exponential regarding input sentence length (Koehn 2009).
In computational complexity, the MT decoding is NP-Complete (Papadimitriou 2003).
For this reason, it is necessary to reduce the search space of the decoding algorithm, such

13
Machine translation and its evaluation: a study 10163

Fig. 21  An example flow of Hypotheses Expansions and Best Path, Koehn, 2009, reproduced from the
information in Koehn (2009b)

Fig. 22  Hypothesis Expansions in a Stack Decoder, Koehn, 2009, reproduced from the information in
Koehn (2009b)

as hypotheses recombination and pruning (Koehn 2009). Compared to the risky pruning
approach, the hypotheses recombination is a risk-free approach. In the hypotheses recombi-
nation (Och et al. 2001), if there are any two hypothesis paths lead to two matching hypoth-
eses, and involve same number of foreign words translated and same English words in the
output, then the hypothesis with the worse score is dropped (Koehn 2009). Furthermore,

13
10164 S. K. Mondal et al.

if there are any two hypotheses exist that cannot be distinguished by the translation model
state and language model state, then these two hypotheses are recombined, and the hypoth-
esis which has the worse score is dropped, because apart from the cheapest hypothesis,
the other hypotheses are impossible to be the part of the best translation. The hypotheses
recombination reduces the number of hypotheses stored in each stack, whereas it is not
enough.
Pruning and Future Cost Estimate: Pruning removes bad hypotheses early. It compares
the hypotheses which cover the identical number of words of the source language (input
words), and prunes out the weak hypotheses based on the cost. However, it is unreason-
able to merely compare the current probability costs of the hypotheses. The search process
tends to translate the simpler part of the sentence. For example, in a four-word English
sentence “How old are you?” may be translated into Chinese (in English meaning) “Why
always you?”. It is because in Chinese, the character of this sentence appears more com-
mon than the character of the accurate translation; hence, the probability cost of the charac-
ter is cheaper than the accurate one.
To alleviate this problem, the pruning strategy must consider either the current prob-
ability cost or the future probability cost. Intuitively, the future cost estimate is to estimate
how expensive is the translation of the rest of the sentence, and try to make the cost mini-
mal (choose the cheapest translation options). Note that the pruning strategy is risky. The
optimal translation option can potentially be pruned out if defects occur in the process of
the future cost estimate.

2.6 WBMT vs. SBMT vs. PBMT

From the comparative analysis shown in Table 3 and 6, we can say that the performance of
WBMT can not match with SBMT and PBMT. Now, the goal is to find the winner between
SBMT and PBMT. We observe that there are many rules in SBMT and it is necessary
to filter those rules in use. In addition, it is in need of good parsers, however, the perfor-
mance of parsers are often bad. Besides, it has complex syntax and it is slow in training
and decoding. On the other hand, compared with these issues, PBMT works better. Nota-
bly, the authors in Kaljahi et al. (2012), present a comparative analysis between SBMT and
PBMT and validate that PBMT outperforms SBMT by a fair share, as shown the compara-
tive analysis in Table 7. Also, the authors in Zollmann et al. (2008) show that PBMT per-
forms better than SBMT.
Notably, this analysis is carried out to determine the standard SMT approach that was
used by the community before as the mainstream MT system. Also, to let the researcher
community and the developer community for the comparative analysis of the NMT
approaches.

3 Neural machine translation

The innovational establishment of NMT is the breakthrough of the whole MT industry. In


recent years, the rise of NMT is because of the developments of neural network (NN) and
deep learning. In 1957, Frank Rosenblatt of the Cornell Aeronautical Laboratory intro-
duced the Perceptron algorithm (Kussul et al. 2001), which is the simplest single layer

13
Machine translation and its evaluation: a study 10165

neural network. However, the perceptron was unfeasible to solve the complicated problems,
which drastically restricts its advancement (development). For this reason, the research on
NN had been stagnated for more than 20 years (Li et al. 2018).
After 1980s, the public interests towards NN were aroused, because the back-propa-
gation algorithm (BP) (Hecht-Nielsen 1992) was brought into the multilayer perceptron
(MLP) (Pal and Mitra 1992), which is also called the feed-forward neural network (FNN).
Nowadays, with the rapid enhancement of the computational performance and the wide
application of the graphic processing unit (GPU), NN achieves significant success in many
aspects, including MT. Nowadays, NMT mostly adopts the Recurrent Neural Networks
(RNN) and the Transformer Neural Network (TNN).
In RNN-based model, an NMT system only requires single model and language pairs
(in sentence) to implement the translation process, based on NN. The models in NMT sys-
tems belong to an architecture of encoder–decoder (Sutskever et al. 2014), which is called
the end-to-end translation (the meaning of end-to-end translation is that to implement the
whole translation process). The encoder-decoder model jointly calculates the probabilities
of the translation of the given bitext. The sentences of each language are passed through
an encoder and a decoder. Next, the encoder NN encodes the sentences into a fixed-length
vector and the translation is generated as the output from the decoder NN (Cho et al.
2014a). The encoder-decoder architecture is effective though, it still faces problems when
copes with long sentences. It is difficult for the NN in this architecture to compress as
much useful information of the source sentence as possible to the fixed-length vector. As
the sentence length and unknown word increase, the performance of the translation sees a
dramatic decline (Cho et al. 2014b). To remedy the defects, the attention-based recurrent
neural network architecture for encoder-decoder (Bahdanau et al. 2014; Luong et al. 2015)
was introduced, which encodes the input sentences into a sequence of vectors rather than
encoding them into a fixed-length vector, making NMT a state-of-the-art MT approach and
carries out the translation considerably. On the other hand, Transformer simply adopts the
attention mechanism, which helps to simplify the training process and reduce the computa-
tional complexity in the modelling.

3.1 RNN encoder‑decoder in NMT

With the earlier version of NMT, NMT uses encoder-decoder with RNN to implement the
sequence-to-sequence transfer. Herein, before diving into the RNN encoder-decoder NMT
mechanism, we first present the human analogy for taking decisions or doing any tasks. As
we know, the thoughts from humans do not come out without certain foundations. On the
contrary, human thinks usually based on the previous experiences or knowledge that lies
in the mind, with durability. In other words, usually when representing something, we the
human beings transfer that instance to another object and master the sequence transduc-
tion. In other words, human beings can handily envisage a mental depiction of a sequence.
For example, if someone says, “I like to play with my husky in snow weather”, then we
envisage a husky playing in snow weather”. We see the images of the husky in our minds,
even though we might never have seen that husky. In language translation, our (human)
transduction takes a sentence of a source language and builds a cognitive representation of
its meaning, then transform that transduction into an interpretation of that sentence into the
target language. The traditional types of NN are not so, using the previous representations
to predict the next move. This is a huge limitation of NN - the network has to learn the
transduction from the given sequence with numerical representations. Luckily, recurrent

13
10166 S. K. Mondal et al.

Table 7  SBMT vs. PBMT


En-Fr En-De
BLEU NIST TER GTM METEOR BLEU NIST TER GTM METEOR

PB 0.6140 10.73 0.3584 0.6357 0.7436 0.5099 9.48 0.4911 0.5441 0.6264
HP 0.6188 10.78 0.3535 0.6400 0.7457 0.5289 9.68 0.4676 0.5592 0.6408
TS 0.5919 10.39 0.3719 0.6194 0.7284 0.4939 9.19 0.4923 0.5349 0.6146
ST 0.6013 10.53 0.3631 0.6258 0.7334 0.5086 9.43 0.4753 0.5479 0.6265
TT 0.5783 10.25 0.3842 0.6096 0.7168 0.4784 9.03 0.5059 0.5219 0.6051

The analysis is performed with a variants of SBMT and PBMT, such as, PB (standard phrase-based sys-
tem), HP (hierarchical phrase-based system), TS (tree-to-string syntax-based system), ST (string-to-tree
syntax-based system), TT (tree-to-tree syntax-based system). The evaluation is shown with English-French
and English-German Translation. The comparative analysis is carried out with different evaluation methods,
such as BLEU, NIST, GTM, METEOR, and TER where the scores BLEU, NIST, GTM, and METEOR are
the larger the better, and the score TER is the lower the better. The analysis validates that PBMT outper-
forms SBMT by a fair share. Kaljahi et al. (2012)

neural network, RNN solves this problem, however, it fails to achieve a significant accu-
racy level. RNN possesses the “memory cells” and it has backward connections (recur-
rence) within the hidden layer as shown in Fig. 23. RNN is perfectly capable of handling
the sequential data and can perform arbitrary parallel computations (Sutskever et al. 2014).
Given a sequence of inputs (x1 , … , xT ), the hidden state h is calculated using the follow-
ing iterative equations by RNN at each time state t:
h⟨t⟩ = f (h⟨t−1⟩ , xt ).

where f is a non-linear activation function (Sharma et al. 2017). The sequence probability
distribution is yielded by RNN’s prediction on the next symbol in the sequence. Ultimately,
the probability distribution of a sequence x is calculated as the conditional distribution of the
∏T
output at each timestep t, p(x) = t=1 p(xt ∣ xt−1 , … , x1 ) (Cho et al. 2014a). Principally, both
in SMT and NMT, it is to map the input sequences (the encoder) in a fixed-length vector, and
map the target sequences (the decoder) according to the vector representation with fixed-length
using two RNNs. This model is introduced by Cho et al. (2014a) and is shown in Fig. 24.

3.1.1 Cho’s model

Cho’s model (Cho et al. 2014a) aims to learn the conditional probability distribution
p(y1 , … , yT � ∣ x1 , … , xT ) of two variable-length sequences x and y (refer to Fig. 24) - the
sequence length T and T ′ may vary. The encoder RNN reads the input sequence x in order
and the hidden state, h of it changes accordingly. When the encoding process is over, the
last hidden state of the input sequence (encoder) denoted by c (shown in Fig. 24), bears/
carries the summary of the passed input sequence. On the other side, the decoder RNN is
trained to generate the output sequence by predicting the next symbol yt based on the hid-
den state h⟨t⟩. The hidden state equation is updated as follows:
h⟨t⟩ = f (h⟨t−1⟩ , yt−1 , c).

where yt−1 and the summary c of the input sequence are the conditions of yt and h⟨t⟩. Like-
wise, the conditional distribution of the next symbol is:

13
Machine translation and its evaluation: a study 10167

Fig. 23  Recurrent Natural Network

Fig. 24  Cho’s RNN Encoder-


Decoder Model, Cho et al., 2014,
reproduced from the information
in Cho et al. (2014a)

P(yt ∣ yt−1 , yt−2 , … , y1 , c) = g(h⟨t⟩ , yt−1 , c),

where f and g are the activation functions (Sharma et al. 2017). After training, we can use
this RNN Encoder-Decoder model to generate a target sequence with an input sequence.
Besides, the conditional score of the language pair can also be yielded.
We observe that Cho et al. (2014a), in their model introduce a hidden activation func-
tion called Gated Recurrent Unit (GRU), which is inspired by the Long Short-Term Mem-
ory (LSTM) (Colah 2015; Graves 2012; Staudemeyer and Morris 2019; Wiki 2019b). They
propose to replace the hidden unit of the basic RNN by GRU. As we know, the basic RNN
encounters the issue of forgetting the past input sequences (long term dependency prob-
lem), as the information is lost at each step going through the RNN (Colah 2015). LSTM,
GRU, and some other variations of RNN come in place to address this issue. Notably,
LSTM is the most commonly and popularly used and it has long-term, short-term memory

13
10168 S. K. Mondal et al.

Table 8  BLEU scores on the development and test sets between baseline (PBMT) and RNN Encoder-
Decoder model, Cho et al., 2014, reproduced from the information in Cho et al. (2014a)
Models BLEU
dev test

Baseline 30.64 33.30


RNN 31.20 33.87

to remember past input sequences (Colah 2015; Graves 2012; Staudemeyer and Morris
2019; Wiki 2019b). However, GRU is computationally much simpler than LSTM (Cho
et al. 2014a). Notably, GRU is also capable of capturing both long-term and short-term
histories.
Datasets and Experimental Analysis:Cho et al. (2014a) perform the evaluation of their
approach on the English-French translation task datasets of the WMT’1415 workshop (ACL
2022a). Notably, abundant resources are ready for use to build an English-French trans-
lation system in the framework of the WMT’14 translation task. The bilingual corpora
include Europarl v7(628 MB in size), Common Crawl corpus (876 MB), UN Corpus (2.3
GB), News commentary (77 MB), 109-French-English corpus (2.3 GB) respectively.
Cho et al. (2014a) perform a comparative analysis with the phrase-based SMT system
(considered as the baseline in their study), as shown in Table 8. We observe that the RNN
Encoder-Decoder model achieves better BLEU score than the baseline model.
Summary: Cho’s Model (Cho et al. 2014a) is a composition of SEQ2SEQ framework
with GRU as the hidden activation unit.
Cho’s model achieves better BLEU score compared with the baseline PBMT (Google
SMT) model. However, it encounters some issues, such as translation quality degrades
when the sequence length and the unknown words increases (Cho et al. 2014b). Notably,
Sutskever’s model (Sutskever et al. 2014) works in minimizing some of the issues.

3.1.2 Sutskever’s model

In this model, Sutskever et al. (2014) describe a general end-to-end approach about
sequence learning with basic assumptions on the sequence structure. This model also
involves an RNN Encoder-Decoder architecture as in Cho’s model (Cho et al. 2014a),
where the whole input sequence was read by the encoder to map the sequence into the
fixed-length vector one timestep at a time. Next, the decoder extracts the output sequence,
using the internal vector representations, until reaching the end of the sequence. Figure 25
shows a depiction of the Sutskever’s model, where the “EOS” token denotes the end of the
sequence. The “EOS” token enables the model to be available to all possible lengths of the
sequences.
The model receives “ABC”, “EOS” as the input sequence and encodes it in the encoder,
then generates “WXYZ”, “EOS” as the output sequence by the decoder. Once outputting
the “EOS” tagger, the decoder stops making predictions.

15
WMT’14 Translation Task https://​www.​statmt.​org/​wmt14/​trans​lation-​task.​html.

13
Machine translation and its evaluation: a study 10169

Fig. 25  Sutskever’s Model, Sutskever et al., 2014, reproduced from the information in Sutskever et al.
(2014)

Notably, Sutskever’s model is similar like Cho’s model (Cho et al. 2014a). However,
it does not adopt the GRU networks as the RNN unit, unlike Cho’s model. Instead, the
LSTM (deep 4 layers LSTM) networks are used in Sutskever’s model to solve the typical
sequence-to-sequence problem.
In Sutskever’s model, the LSTM networks learn to map an input of variable length into
a fixed-dimensional vector representation, making this model aware of word order and con-
siderably robust (Sutskever et al. 2014). Given the input sequence (x1 , … , xT ) and the out-
put sequence (y1 , … , yT � ), the conditional probability distribution p(y1 , … , tT � ∣ x1 , … , xT )
is estimated by the model with LSTM networks. Similar to the Cho’s model, the length of
the input sequence and the output sequence can be different.
Datasets and Experimental Analysis: Like Cho et al. (2014a), Sutskever et al. perform
the evaluation of their approach on the English-French translation task datasets of the
WMT’14 workshop (ACL 2022a). Notably, to find the best translation in decoding, they
use the Beam search as in the decoding in SMT approach.
They perform a comparative analysis with the phrase-based SMT system (the baseline
model), other existing models, and a different combinations of their proposed method, as
shown in Table 9. The authors carried out the analysis with the different ensembles of
LSTMs and demonstrated that the deep stacked LSTM model outperforms the baseline
(i.e., phrase-based SMT) by a fair (sizeable) margin.
Summary: Sutskever’s Model is a composition of SEQ2SEQ framework with deep
stacked LSTM, plus decoding with Beam search.
Sutskever’s model optimizes translation quality compared with other models. However,
it encounters some issues with sentence length. Notably, it is hard for RNN to handle the
input and output sentences with different lengths. This is also a significant limitation of
both the Sutskever’s model and Cho’s model. The only information the decoder can receive
from the encoder is a vector representation, the last hidden state of the encoder. It contains
the numerical summary of the input sequence. In particular, it is easy for a decoder to per-
form the operation with a short input sequence, but what about extracting information from
an input sequence with a long length. It leads to the forgetting issue for the decoder, and
the model’s performance sees a dramatic drop, even though adopting the LSTM or GRU
modules (Cho et al. 2014b; Karim 2019). To address these issues, an encoder-decoder
with attention-based mechanism was introduced, which encodes the input sentences into a
sequence of vectors rather than encoding them into a fixed-length vector, thereby making
NMT a state-of-the-art MT approach (Bahdanau et al. 2014; Luong et al. 2015).

13
10170 S. K. Mondal et al.

Table 9  A comparative Method BLEU


analysis of the LSTM on
WMT’14 English-French
Bahdanau et al. (2014) 28.45
dataset, Sutskever et al., 2014,
reproduced from the information Baseline System (Phrase-based SMT) (Cho et al. 2014a) 33.30
in Sutskever et al. (2014) Single forward LSTM, beam size 12 26.17
Single reversed LSTM, beam size 12 30.59
Ensemble of 5 reversed LSTMs, beam size 1 33.00
Ensemble of 2 reversed LSTMs, beam size 12 33.27
Ensemble of 5 reversed LSTMs, beam size 2 34.50
Ensemble of 5 reversed LSTMs, beam size 12 34.81

LSTM outperforms baseline model

3.2 RNN encoder‑decoder with attention mechanism in NMT

The NMT with attention mechanism is introduced by Bahdanau et al. (2014), which adopts
the sequence-to-sequence framework. The attention mechanism enables the NMT model
to access and retrieves the memory (the source input sequence) when needed (while learn-
ing) to keep alignment of the input and the output sequence, similar to the SMT models.
Furthermore, the NMT with attention mechanism also works like a human translator which
translates and checks the source language while rather than remembering the whole source
language text to do the translation.
Notably, in the attention-based NMT, the attention network is integrated with the RNN
Encoder-Decoder architecture, as shown in Fig. 26. However, there are differences between
the NMT with attention and its traditional counterparts. Firstly, the attention mechanism
provides the information from every single hidden state of the encoder to the decoder
(apart from the last hidden state of the encoder), but not the whole input sequence as in the
traditional NMT models. For this reason, the alignment is learned by selecting the effec-
tive parts of the input sequence (Karim 2019; Meng et al. 2016). Secondly, the decoder
involves a scoring mechanism of every hidden state of the encoder - each hidden state is
associated with a certain word in the input sequence. The hidden state with the highest
score is amplified. Initially, we get the score of every hidden state of the encoder by an
alignment model. These scores denote the attention weights for the decoder. The alignment
model can be parameterized and jointly trained with other components of the whole system
(Bahdanau et al. 2014). Next, after obtaining the scores, use a softmax function (Bouchard
2007) to normalize the output probabilities for each word, and we can get a softmax score.
Then, the softmax score with its corresponding hidden state is multiplied to generate the
alignment vector or the annotation vector. After that, alignment vectors are summed up to
generate the context vector, which is the information aggregation of all alignment vectors
from the previous step. Finally, pass the context vector to the decoder. The current hidden
state of the decoder then generates the target word based on the information provided by
the context vector.

13
Machine translation and its evaluation: a study 10171

Fig. 26  The Process of Attention Mechanism (image is adopted based on Bahdanau et al. (2014)), Meng
et al., 2016, Meng et al. (2016)

3.2.1 Bahdanau’s model

Bahdanau et al. (2014) are the first to implement the attention mechanism in the NMT.
They introduce an NMT method using Cho’s sequence-to-sequence model (Cho et al.
2014b) by learning to align and translate. Figure 27 depicts the alignment outcome whose
source sentence is in English and the target sentence is in French. We can clearly see that
the attention model successfully aligns the two languages, and also reflects some syntactic
adjustments, showing exciting performance.
The encoder of Bahdanau’s model adopts the bidirectional RNN (Li et al. 2019; Liu
et al. 2021; Schuster and Paliwal 1997), which consists of a forward RNN and a backward
RNN. Given an input sequence → ( x1 , …→ , xT ), the forward RNN reads and calculates the for-
ward hidden state sequence ( h 1 , … , h T ) in its normal order. On the
← contrary,
← the backward
RNN reads and calculates the backward hidden state sequence ( h 1 , … , h T ) in the reverse
order of the input sequence. The annotation hj of each word xj is then obtained by concat-
enating the forward and backward sequence. Furthermore, the decoder uses this annotation
sequence to generate the context vector and the alignment. The benefit of this approach is
that the annotation of each word includes the context around the word itself. What is more,
the decoder of Bahdanau’s model executes the attention mechanism.
Figure 28 depicts the proposed model attempting to generate the t-th target word yt. The
encoder is a bidirectional RNN and at,1 to at,T are the alignment weights in the attention layer.
The decoder is a different RNN which receives input from the previous state and the dynamic

context vector . Hence, the encoder separates the information it carries to the annotations

13
10172 S. K. Mondal et al.

Fig. 27  An sample alignment of the source sentence (English) And the target sentence (French), Bahdanau
et al., 2014, reproduced from the information in Bahdanau et al. (2014)

so that it preserves as much information as it can and the decoder can decide which vector it
should pay attention to.
Datasets and Experimental Analysis: Like Cho et al. (2014a) and Sutskever et al. (2014),
Bahdanau et al. (2014) perform the evaluation of their approach on the English-French trans-
lation task datasets of the WMT’14 workshop (ACL 2022a). To find the best translation in
decoding, they use the Beam search as like the decoding in SMT and in Sutskever’s model
(Sutskever et al. 2014). Notably, the authors perform evaluation with two types of models - the
Cho’s model (RNN Encoder–Decoder model, referred as RNNencdec) (Cho et al. 2014a))
and their proposed model, referred as RNNsearch (Bahdanau et al. 2014). The RNNsearch
outperforms the RNNencdec as shown in Table 10 and in Fig. 29.
Summary: Bahdanau’s model is a composition of SEQ2SEQ framework with bidirectional
GRU (BiGRU) as the hidden activation unit having attention mechanism integrated with, plus
decoding with Beam search.
Bahdanau’s model achieves better result compared with Cho’s model (Cho et al. 2014a).
However, the Bahdanau’s model is complex in nature. To the end, Luong et al. (2015) intro-
duce a simplified and generalized architecture based on Bahdanau’s model.

3.2.2 Luong’s model

Luong et al. (2015) simplify the attention mechanism based on Bahdanau’s model (Bah-
danau et al. 2014), with a simpler computation path. They put forward two different kinds

13
Machine translation and its evaluation: a study 10173

Fig. 28  The Proposed Model


Attempting To Generate The
t-th Target Word yt , Bahdanau
et al., 2014, reproduced from the
information in Bahdanau et al.
(2014)

Table 10  Comparison is shown with BLEU score between Cho’s model (RNN Encoder-Decoder model,
referred as RNNencdec) (Cho et al. 2014a) and Bahdanau’s proposed model (referred as RNNsearch)
(Bahdanau et al. 2014) where the number preceded by sentence length

Model All No UNK o

RNNencdec-30 13.93 24.19


RNNsearch-30 21.50 31.44
RNNencdec-50 17.82 26.71
RNNsearch-50 26.75 34.16
RNNsearch-50∗ 28.45 36.15
Moses 33.30 35.63

Notably, the evaluation is carried out separately with all the sentences and the sentences having no
unknown (No UNK) words. We see that the RNNsearch outperforms the RNNencdec, reproduced from
the information in Bahdanau et al. (2014)

of attention models - global attention (Bahdanau’s model can be regarded as a simplified


global attention model) and local attention respectively. With global model, attention is
imposed to all the source words, whereas with the local model, attention is only imposed to
a small subset of source words.
In Luong’s model, both the global and local attentions adopt a 4 layered (stacked) LSTM
networks. While decoding, both attention models receive the top hidden state of the LSTM
networks as the input and derive the context vector ct , which is similar to the previous

13
10174 S. K. Mondal et al.

Fig. 29  Comparison is shown with BLEU score having rare/unknown (UNK) words. The authors per-
form training on two types of models - the RNN Encoder–Decoder (RNNencdec, Cho’s model Cho et al.
2014a), and their proposed model, RNNsearch (Bahdanau et al. 2014) where the number preceded by
sentence length. The RNNsearch outperforms the RNNencdec (Bahdanau et al. 2014)

Fig. 30  Global Attention (Luong


et al. 2015)

Bahdanau’s model. Notably, a concatenation layer is then employed to combine the top
hidden state ht and the context vector ct . The concatenation layer outputs an attentional hid-
den state h̃ t . The h̃ t is passed through a softmax layer to obtain the predictive distribution.
We find that the global and local attention models mainly differ in deriving the context
vector as shown in Fig. 30 and 31. In the global attention, the encoder receives all of the

13
Machine translation and its evaluation: a study 10175

Fig. 31  Local Attention (Luong


et al. 2015)

Table 11  The authors show the comparative analysis ensembling with different combinations of models
System BLEU

1. Winning WMT’14 system - phrase-based + large LM (Buck et al. 2014) 20.7


2. Jean’s NMT systems (Jean et al. 2015)
RNNsearch (Jean et al. 2015) 16.5
RNNsearch + unk replace (Jean et al. 2015) 19.0
RNNsearch + unk replace + large vocab + ensemble 8 models (Jean et al. 2015) 21.6
3. Luong’s NMT systems (Luong et al. 2015)
Base 11.3
Base + reverse 12.6 (+1.3)
Base + reverse + dropout 14.0 (+1.4)
Base + reverse + dropout + global attention (location) 16.8 (+2.8)
Base + reverse + dropout + global attention (location) + feed input 4 18.1 (+1.3)
Base + reverse + dropout + local-p attention (general) + feed input 19.0 (+0.9)
Base + reverse + dropout + local-p attention (general) + feed input + unk replace 20.9 (+1.9)
Ensemble 8 models + unk replace 23.0 (+2.1)

Here, the top layer is non-attentional phrase-based model, the mid layer is the Jean’s model (Jean et al.
2015) is built on top of Bahdanau’s model (Bahdanau et al. 2014), and bottom layer is the ensembles of
Luong’s model (Luong et al. 2015). We observe that attention model outperforms the non-attention model
and Luong’s ensemble model wins the race by a fare margin, reproduced from the information in Luong
et al. (2015)

hidden states as the input and produce the context vector where the alignment vector at is
derived comparing the probability distribution over the current target hidden state, ht and
all the source hidden states, h̃ s as at (s) = 𝖺𝗅𝗂𝗀𝗇(ht , h̃ s ) (Tixier 2018). The alignment vector at
is regarded as weights. Given the weights at , the context vector ct is the weighted average
over h̃ s. Figure 30 is the depiction of the global attention.

13
10176 S. K. Mondal et al.

On the other hand, the local attention avoids focusing on all source words, which pro-
vides a cheaper alternative in computation compared to the global attention, and performs
better with the long sequences. The local attention mechanism is proposed to only focus on
a small window of context. Firstly, for every word at time t, the model predicts an aligned
position pt . Given, the aligned position pt , the context vector ct is the weighted average
over h̃ s within the window [ pt − D, pt + D], where D is derived empirically. Figure 31 is
the depiction of the local attention.
Interlude: Hard Attention and Soft Attention: Hard and soft attention model proposed
by Xu et al. (2015), focuses on the graphic processing. The soft attention distributes
weights to all areas of the source graphic (image), with greater computational cost; the
hard one on the contrary only pays attention to parts of the source image so that less
computing resources are used/consumed. Therefore, we can say that not only in MT, the
attention mechanism also plays a significant role in image captioning, i.e., adding cap-
tions to describe the image based on the attention model, as mentioned herein Xu et al.
(2015). Besides, in entailment reasoning, given the premise scenarios and hypotheses
in English, the attention mechanism helps to determine the relationship between the
premises and the hypotheses: contradiction, relation, and entailment (Rocktäschel et al.
2015); and speech recognition, etc. Now, we come back to the context of this paper. The
hard attention can be regarded as a local approach.
Datasets and Experimental Analysis: Luong et al. (2015) performed the evaluation
of their approach on the English-German translation task datasets of the WMT’14 (ACL
2022b) and WMT’1516 (ACL 2022b) workshop. Notably, the authors performed the com-
parative analysis ensembling with different combinations, as shown in Table 11. Notably,
the top layer is the non-attentional phrase-based model, the middle layer is the Jean’s model
(Jean et al. 2015), built on top of Bahdanau’s model (Bahdanau et al. 2014), and the bot-
tom layer is the ensembles of Luong’s model (Luong et al. 2015). We observe that attention
model outperforms the non-attention model and Luong’s ensemble model and wins the
race by a fare margin.
Summary: Luong’s Model is a composition of SEQ2SEQ framework with deep stacked
LSTM having both global and local attention mechanisms.
Luong’s model achieves a better result compared with other non-attention systems.
While the translation quality is promising, but it can not match the performance of PBMT
in the production.
Further, the family of the attention mechanism also involves the Self-Attention (Yang
et al. 2019) and many variations, such as the Hierarchical Attention (Yang et al. 2016),
Attention Over Attention (Cui et al. 2016), Multi-step Attention (Gehring et al. 2017), as
well as Multi-dimensional Attention (Wang et al. 2017), and others. These researches and
corresponding experiments make the attention mechanism more and more comprehensive.

3.2.3 Google’s neural machine translation (GNMT)

Google’s Neural Machine Translation (GNMT) was introduced by Wu et al. (2016) in


2016, which adopts the sequence-to-sequence framework having the encoder-decoder with
the attention mechanism. As stated earlier that Google migrated the machine translation
engine of “Google Translate” to NMT in 2016. Hence, demonstrating the Google’s NMT

16
WMT’15 Translation Task https://​www.​statmt.​org/​wmt15/​trans​lation-​task.​html.

13
Machine translation and its evaluation: a study 10177

model is imperative. Besides, to say that most of us use Google Translate in our
daily life, so it is significant to know about its underlying architecture and working prin-
ciples. Most important, Wu et al. (2016) introduced GNMT to address some of the exist-
ing issues of proposed and existing NMT models and to make it more stringent. Some of
the issues of NMT systems are computationally expensive training and translation, weak
robustness (especially sentences containing rare words), poor translation quality and accu-
racy (especially with low resource languages), low processing speed, and many other.
GNMT jumps in and tries to minimize the issues.
Architecture: As stated earlier, GNMT adopts the sequence-to-sequence framework
consisting of encoder-coder with attention mechanism. The architecture of GNMT is
shown in Fig. 32. The encoder is deep stacked networks with eight LSTMs. In the encoder,
the first LSTM is bidirectional. Besides, there is a residual connection (starting from layer
3) between the outputs of consecutive layers. The decoder module is also deep stacked
with eight LSTMs (all of them unidirectional). The attention module is in the middle (in
between encoder and decoder), is a feed forward network. The encoder transforms an input
string into an array of vectors. Given the array of vectors, the decoder generates symbols,
until the end. The end is marked by a special symbol, called end-of-sentence (EOS). The
attention module lets the decoder pay attention to the various regions of the input string
during the decoding operation.
Residual Connections: We observe that residual connections between two layers help
optimize the gradient flow. We know that deep stacked LSTM networks can enhance
accuracy compared with shallower networks. Nonetheless, simply stacking layers is
effective to a certain number. Beyond that, the networks become very slow and it is not
easy to train a model, especially due to the issue of vanishing gradients and explod-
ing gradients. The authors validate that the stacked LSTM network works well up to
4 layers, hardly with 6 layers, and very badly beyond 8 layers. Most importantly, to
address this issue, in GNMT architecture, stacked LSTM layers are adopted with resid-
ual connections. Note that residual connections between the consecutive LSTM layers,
help optimize the gradient flow in the back pass. This lets us train very deep or stacked
encoder-decoder networks.
Model Parallelism: To speed up training of the complex GNMT system, both model
parallelism and data parallelism are adopted. In case of data parallelism, n model repli-
cas are concurrently trained. Note that model parallelism helps to optimize the computing
speed on each replica. Thus, in model parallelism, encoder and decoder modules are sepa-
rated along the depth axis and laid down on different GPUs. This effectively helps us run
each layer on distinct GPU.
Segmentation Approaches: We find that most NMT systems lack robustness in translat-
ing rare words or out-of-vocabulary (OOV) words. To get rid of this issue, the authors use
the “Wordpiece Model (WPM)”. Note that the WPM approach was developed to address a
Japanese-Korean segmentation issue for the speech recognition in Google and a similar
approach is used Sennrich et al. (2015) to solve the issue with OOV word or rare words in
NMT systems.
Decoder: As in other models in decoding, the GNMT uses Beam search. Notably, it is
more advanced and adapted in contrast to other models - two refinements, such as a cover-
age penalty and length normalization are introduced to the beam search algorithm. Length
normalization helps ensure perfect scoring for longer sentences, avoiding the favoritism
over short sentences. On the other hand, coverage penalty is to back (support) the transla-
tion that entirely covers the input.

13
10178 S. K. Mondal et al.

Fig. 32  The architecture of GNMT, Wu et al. (2016). The encoder networks is deep stacked with 8 LSTMs.
In the encoder, the first LSTM is bidirectional. Besides, there is a residual connection (starting from layer
3) between the outputs of consecutive layers. The decoder module is also deep stacked with 8 LSTMs (all
of them unidirectional). The attention module is in the middle (in between encoder and decoder), is a feed
forward network

Datasets and Experimental Analysis: GNMT performed the evaluation of their approach
on the English-German and English-French translation task datasets of the WMT’14 (ACL
2022b) workshop. GNMT also carried out thorough experiments on Google’s internal pro-
duction datasets and found the following results shown in Table 12. We observe a com-
parative analysis among the production PBMT system used by Google (Google SMT),
the GNMT system (Google NMT), and evaluations by humans (Human Evaluation). The
scores (mean rated scores) are presented for the pairs of English-Spanish, English-French,
and English-Chinese. From the results, we can say that the GNMT model optimizes trans-
lation accuracy (quality) by over than 60% compared with the PBMT (Google SMT) sys-
tem on the vastly spoken pairs of languages. Also to note that the researchers and develop-
ers of GNMT (Wu et al. 2016) show comparative analysis with many other models, such as
RNN Encoder-Decoder models, RNN Encoder-Decoder with Attention mechanism based
models (e.g., Bahdanau’s Model Bahdanau et al. 2014, Luong’s Model Luong et al. 2014,
and other) and validate that GNMT outperforms all of the models.
Evolution (Recent Advances) in GNMT: We observe that GNMT is not only improving
in translation accuracy, but also in the past years, it has covered the translations over the
100 languages, as shown in Fig. 33. Notably, in every coming year, new language pairs are
added to GNMT and also optimize the translation quality of the existing pairs.
Summary:
GNMT is a composition of SEQ2SEQ framework being stacked with eight LSTMs at
each encoder and each decoder, and the attention mechanism is in the middle layer con-
necting both the encoder and decoder. Notably, in the encoder, the first LSTM layer is bidi-
rectional and from layer 3, residual connections exist between the consecutive layers. On
other hand, all the LSTMs in the decoder are unidirectional. While evaluating, the experi-
mental analysis validates that Google NMT outperforms Google SMT by a large margin.

13
Machine translation and its evaluation: a study 10179

Table 12  Mean rated scores on PBMT GNMT Human Relative


Google’s internal production Improve-
data, Wu et al. (2016) ment

English → Spanish 4.885 5.428 5.504 87%


English → French 4.932 5.295 5.496 64%
English → Chinese 4.035 4.594 4.987 58%
Spanish → English 4.872 5.187 5.372 63%
French → English 5.046 5.343 5.404 83%
Chinese → English 3.694 4.263 4.636 60%

The GNMT model optimizes translation accuracy (quality) by over


60% compared with the PBMT (Google SMT) system on the vastly
spoken pairs of languages, such as English-Spanish, English-French,
and English-Chinese

Fig. 33  Evolution (recent advances) in GNMT, reproduced from the information in Caswell and Liang
(2022)

3.3 Transformer in NMT

Transformer, introduced by Vaswani et al. (2017), is a simple network architecture solely


built on top of attention mechanisms. As stated earlier, attention mechanisms becomes
an integral part of machine translation and are used in conjunction with RNNs, such as
LSTMs, GRUs to address the issues stated earlier. However, as we know, LSTM or GRU
are not computationally efficient for large datasets and are also attached to numerous issues
as stated earlier. As such, Vaswani et al. (2017) proposed a new model, the Transformer
that renounces recurrence and rather it entirely relies on self-attention technique while
computing the global dependencies between source (input) and target (output). In addition,

13
10180 S. K. Mondal et al.

Fig. 34  The Architecture of basic


Transformer, Vaswani et al.,
2017, reproduced from the infor-
mation in Vaswani et al. (2017)

Transformer helps us ensure better parallel processing of tasks and helps achieve better
translation quality (Rothman 2021; Vaswani et al. 2017). It is a huge breakthrough in the
expansion of deep learning. Now, we explore the Transformer thoroughly and observe
how it helps toward the progression of deep learning, especially, in this paper, toward the
machine translation.

3.3.1 Vasawani’s transformer model

The overall architecture of a Transformer proposed by Vaswani et al. (2017) is a stack of


self-attention and point-wise, fully connected layers with an encoder module and a decoder
module (shown in Fig. 34). The encoder is on the left and the decoder is on the right, each
of which is stacked with 6 layers as with the original Transformer proposed by Vaswani
et al. (2017). However, in some current versions of Transformer, there are varied num-
bers of layers and no Decoder at all. Notably, no RNN, LSTM, or CNN is integrated for
recurrence. On the other hand, the recurrence has been replaced by the attention (Rothman
2021; Vaswani et al. 2017).
Encoder: The encoder is a stack of 6 (identical) layers where each of it is divided into
two constituent sub-layers: a multi-head self-attention mechanism and a position-wise fully

13
Table 13  The Transformer outperforms other prominent models by a large margin on the English-German and English-French translation (Comparative analysis is shown
with BLEU score)
Model BLEU (EN-DE) BLEU (EN-FR)

ByteNet (Kalchbrenner et al. 2016) 23.75


Machine translation and its evaluation: a study

Deep-Att + PosUnk (Zhou et al. 2016) 39.2


GNMT + RL (Wu et al. 2016) 24.6 39.92
ConvS2S (Gehring et al. 2017) 25.16 40.46
MoE (Shazeer et al. 2017) 26.03 40.56
Deep-Att + PosUnk Ensemble (Zhou et al. 2016) 40.4
GNMT + RL Ensemble (Wu et al. 2016) 26.30 41.16
ConvS2S Ensemble (Gehring et al. 2017) 26.36 41.29
Transformer (base model) 27.3 38.1
Transformer (big) 28.4 41.0

However, it is difficult to determine which model is the winner. For some language pairs, Transformer performs better and for some language pairs GNMT performs better.
Vaswani et al. (2017)
10181

13
10182 S. K. Mondal et al.

Fig. 35  A depiction of T5 model, a multi-task learning (sharing) text-to-text framework. All NLP tasks
(having “a common template:...”, e.g., “translate English to German:...”) are framed into this framework
where all the tasks follow the same model, same hyperparameters, and same loss function instead of task-
specific model or task-specific loss functions as in the classical paradigm (Raffel et al. 2020)

connected feed-forward network. A residual connection is added in each sub-layer, fol-


lowed by layer normalization. The residual connection helps transfer the raw input to the
normalization function. This helps us preserve the key information (e.g., positional encod-
ing) while not losing during the transmission (Rothman 2021; Vaswani et al. 2017).
Decoder: The decoder is also a stack of 6 (identical) layers like encoder and each of it
is constituted by three main sub-layers: a masked multi-head attention mechanism, a multi-
head attention mechanism, and a position-wise fully connected feed-forward network. Spe-
cifically, words are masked in the masked multi-head attention sub-layer output. It pro-
motes the Transformer to assume the next sequence without knowing/seeing the rest. Like
in the encoder, residual connection and layer normalization is added in each main sub-layer
(Rothman 2021; Vaswani et al. 2017).
Attention: Attention function is the core of Transformer model which is inspired by the
attention in our human brain. Attention mechanism is facilitated as a mapping of queries,
keys, and values (Mueller 2022; Rothman 2021; Vaswani et al. 2017).

– Key (K): It is a word label and helps distinguish words.


– Query (Q): It helps find the best key.
– Value (V): As usual, a value carries information of a label (i.e., a word). Key and value
is a pair and they propagate together.

Attention mechanism helps find relation between a word with all the words in a sequence,
including itself. The basic attention integrated with Transformer is the scaled-dot-
product-attention. To enhance/optimize the performance of the transformer,
multi-head attention is introduced. As the name suggests, it has multiple scaled-
dot-product-attention mechanisms running in parallel - helps process multiple data at the
same time. Transformer uses it as the cross-attention (also known as encoder-
decoder attention) (Gheini et al. 2021a, b) and the self-attention (Roth-
man 2021; Vaswani et al. 2017). As a cross-attention layer, it is used to combine
two different embedding sequences hailing from the encoder and decoder. In particular,
it helps combine queries hailing from the decoder and (key, value) pairs hailing from the
encoder and in the context, it finally helps the decoder to predict subsequent token(s) of a
translated sequence. In sum, it emulates the usual encoder-decoder attention mechanism

13
Machine translation and its evaluation: a study 10183

as in the sequence-to-sequence models discussed earlier (Bahdanau et al. 2014; Wu et al.


2016). On the other hand, in self-attention layer, it is operated as follows (notably,
it takes a single embedding sequence as input). A machine needs to master transduction
from the scratch. To this end, self-attention mechanism helps boost the analytical capabil-
ity of machine (Rothman 2021; Vaswani et al. 2017) having its optimized computational
complexity and parallel computing prowess. Notably, there is computational complexity
in each layer of a transformer. To this, self-attention mechanism helps optimize (reduce)
complexity by processing tasks rapidly since self-attention operates faster than recurrent
network. Besides, it helps ensure better parallel processing of tasks instead of doing them
sequentially.

Datasets and Experimental Analysis: Vaswani et al. (2017) perform the evaluation of
their approach on the English-German and English-French translation task datasets of
the WMT’14 (ACL 2022b) workshop. The Transformer achieves a state-of-the-art BLEU
score, as shown in Table 13. However, it is difficult to determine which model is the win-
ner. For some language pairs, Transformer performs better and for some language pairs
GNMT performs better. Notably, to find the best translation in decoding, the Beam search
method was used as in other methods noted in this paper.
Summary: The overall architecture of a Transformer is a stack of self-attention and
point-wise, fully connected layers with an encoder module and a decoder module. As
stated earlier, Transformer helps us optimize computational complexity, ensure better par-
allel processing of tasks, and helps optimize translation quality.
Note that the base transformer model has led to development of many pre-trained trans-
formers/models and those pre-trained models are now extensively being used in different
NLP related tasks while assuring tremendous outcome/result/success. Notably, these mod-
els are pre-trained with large datasets. In practice, it works well with small datasets and can
be fine-tuned with our needs. Some of the popularly and commonly used transformers are
Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized
BERT Pretraining Approach (RoBERTa) and many others. Many of the lists can be traced
from Hugging Face.17

3.3.2 Google’s T5 model

Raffel et al. (2020) introduce “Text-To-Text Transfer Transformer”, in short


“T5” model, a pre-trained model which has been trained on a large text corpus. In the
study (Raffel et al. 2020), the authors demonstrate how T5 model helps us achieve state-of-
the-art outcome/results on multiple NLP tasks.
An important ingredient in transfer learning while achieving optimal outcome is the
unlabeled data which is used for pre-training. The datasets should be in high quality,
diverse in nature, and massive in size. Existing pre-training datasets do not meet these
requirements. To meet these criteria, the authors introduce a new contemporary open-
source pre-training dataset. The dataset is called as the “Colossal Clean Crawled
Corpus” (C418), which is available under TensorFlow Datasets. The T5 model is pre-
trained on C4 dataset, achieves exemplary outcomes on many NLP tasks.

17
Hugging Face Transformers Index https://​huggi​ngface.​co/​docs/​trans​forme​rs/​index.
18
C4 Dataset https://​www.​tenso​rflow.​org/​datas​ets/​catal​og/​c4.

13
10184 S. K. Mondal et al.

A Multi-task Learning (Sharing) Text-To-Text Unified Framework: With T5, Raffel et al.
(2020) propose a multi-task learning (sharing) text-to-text framework, i.e., a unified text-to-
text framework where the input (source) and output (target) are consistently texts. Notably,
all NLP tasks (e.g., document summarization, machine translation, question answering,
classification tasks, and other) are framed into this framework where all the tasks follow
the same model, same hyperparameters, and same loss function instead of task-specific
model or task specific loss functions as in the classical paradigm, as shown in Fig. 35. This
unified paradigm makes it state-of-art in functioning and boosting operational activities.
Architecture: We observe that the T5 model is basically an Encoder-Decoder Trans-
former as like the base Transformer model introduced by Vaswani et al. (2017). Especially,
it is extended with some modular changes. With T5, three variants of architectures, such
as encoder-decoder, decoder-only language models, decoder-only prefix language models
are compared. Note that encoder-decoder model outperforms across all downstream (fine-
tuning) tasks.
Datasets and Experimental Analysis: T5 achieves state-of-the-art results in all the tasks
except the machine translation. Our study in the context is on Machine Translation, so with
T5 model, we simply demonstrate the experimental analysis of Machine Translation task.
The authors assume that the performance in MT lags due to training with English-only
unlabeled dataset. Another point is that most of the best methods of MT task use round-
trip translation or back-translation. As we know that round-trip translation helps augment
dataset and helps ensure better translation quality (as stated in the later part of this paper).
Also, the authors infer that compared with the best methods, scaling and pre-training with
T5 may be not sufficient toward achieving better performance. In addition, the authors
speculates that training with larger datasets in the sophisticated methods reflected better
outcomes.
Notably, for translation task, the authors perform the operation on English-German,
English-French, English-Romanian translation task datasets of the WMT19 workshop.
Notably, to find the best translation in decoding, they use the Beam search as in aforemen-
tioned other methods noted in this paper.
Summary: T5 model is an Encoder-Decoder Transformer based on the basic Trans-
former introduced by Vaswani et al. (2017), extended with some modular changes, and pre-
trained with large C4 dataset. With T5, we can perform all the downstream tasks optimally
compared with other models, can achieve viable results, and gradually can reach to human
level performance.

3.4 RNN Enc‑Dec vs. RNN Enc‑Dec with attention vs. Transformer

In this section, we present a comparative analysis of the NMT models RNN Encoder-
Decoder, RNN Encoder-Decoder with Attention, and Transformer.
With RNN Encoder-Decoder model (Sect. 3.1), we demonstrate Cho’s Model (Cho
et al. 2014a) (Sect. 3.1.1) and Sutskever’s model (Sutskever et al. 2014) (Sect. 3.1.2). We
obeserve that both Cho’s model and Sutskever’s model achieve better BLEU score com-
pared with the baseline PBMT (Google SMT) model (shown in Table 8 and 9). How-
ever, both the models encounter issues when the sequence length and the unknown words
increases (Cho et al. 2014b), also arises the forgetting issue for the decoder (Cho et al.

19
WMT Workshop https://​www.​statmt.​org/.

13
Machine translation and its evaluation: a study 10185

Table 14  A comparative analysis of T5 model on Machine Translation task


Model BLEU BLEU BLEU
WMT EnDe WMT EnFr WMT EnRo

T5-11B 32.1 43.4 28.1


T5-3B 31.8 42.6 28.2
T5-Large 32.0 41.5 28.1
T5-Base 30.9 41.2 28.0
T5-Small 26.7 36.0 26.8
Previous best 33.8 e 43.8 e 38.5 f

We observe that T5 fails to achieve comparable outcome in the machine translation task. The authors
assume that the performance in MT lags due to training with English-only unlabeled dataset. Another point
is that most of the best methods (e.g., e (Edunov et al. 2018), f (Lample and Conneau 2019) of MT task
use round-trip translation or back-translation. Also, the authors infers that compared with the best methods,
scaling and pre-training with T5 may be not sufficient toward achieving better performance. In addition, the
authors speculates that training with larger datasets in the sophisticated methods reflected better outcomes,
Raffel et al. (2020)

2014b; Karim 2019). Notably, the pros and cons (summary) of these two models are dem-
onstrated at the end of the sections 3.1.1 and 3.1.2.
To address the aforementioned issues, RNN Encoder-Decoder with Attention mecha-
nism was introduced (Bahdanau et al. 2014; Luong et al. 2015). Under RNN Encoder-
Decoder with Attention mechanism (Sect. 3.2), we demonstrate Bahdanau’s Model (Bah-
danau et al. 2014) (Sect. 3.2.1), Luong’s model (Luong et al. 2015) (Sect. 3.2.2), and
Google’s NMT (GNMT) (Wu et al. 2016) (3.2.3). We observe that all these models out-
performs the RNN Encoder-Decoder model (shown in Fig. 29, Table 10, Table 11, and
Table 12). Among these three models, GNMT outperforms other two models (Wu et al.
2016). Notably, the pros and cons (summary) of these models are demonstrated at the end
of the sections 3.2.1, 3.2.2, and 3.2.3. Also, in Sect. 3.2.3, we demonstrate the evolution
(recent advances) in GNMT. As we know, RNN models encounter numerous issues, such
as computationally not efficient, can not ensure better parallel computing, and many others.
To address the aforementioned issues, Transformer with self-attention mechanism was
introduced (Raffel et al. 2020; Vaswani et al. 2017) (Sect. 3.3). With Transformer based
model, we demonstrate Vasawani’s Transformer model (Vaswani et al. 2017) (Sect. 3.3.1)
and Google’s T5 model (Raffel et al. 2020) (Sect. 3.3.2). We observe that Vasawani’s
Transformer achieves better translation quality. However, Transformer based model and
GNMT run with same pace with different pairs of languages (shown in Table 13). On the
other hand, the Google’s T5 model lags behind compared with the best MT models (shown
in Table 14). Notably, the pros and cons (summary) of these models are demonstrated at
the end of the sections 3.3.1, and 3.3.2.
In fine, we can say that Transformer with self-attention, introduced by Vaswani et al.
(2017) and GNMT by Google (Wu et al. 2016) perform better than other approaches.
As we know, Google NMT (GNMT) is used as the mainstream MT system or MTE for
the Google Translate. Good news is that the researchers are continuously working
on NMT models to optimize the translation quality and add the translations for new lan-
guages. Such as, in the following section, we demonstrate some of the prominent methods
and recent advances in MT augmentation.

13
10186 S. K. Mondal et al.

4 Augmentation of machine translation

We find that it is difficult to achieve a high accuracy in translation for low-resource bilin-
gual datasets or a small size bilingual corpora. We observe that the prominent methods
built on top of the baseline NMT or SMT models can help address this issue and conse-
quently help optimize the quality of translations significantly. In particular, we find that
round-trip translation (Ahmadnia and Dorr 2019; Somers 2005) and pivot (bridging) lan-
guage (Liu et al. 2018; Wu and Wang 2007) can help us elegantly in this case. Also, knowl-
edge graph (Ahmadnia et al. 2020; Bollacker et al. 2008; Lu et al. 2018; Moussallem et al.
2019; Zhao et al. 2020a) can help augment the machine translation widely. Apart from that
recent various tenable mechanisms helps enhance the advancement of machine translation
in many different ways.

4.1 Round‑trip translation having low‑resource bilingual dataset

To address the scarcity or the insufficiency of large bilingual dataset, He et al. (2016) intro-
duce an autoencoder-like mechanism, named “Dual learning” to utilize monolingual data-
sets. The mechanism shows how an NMT model can automatically learn from unlabeled
data and perform the round-trip translation to validate it. In particular, in this mechanism,
one agent representing the model performs the source to target translation and another
agent representing the model performs the translation back to the source language from
the target language. Consequently, the reconstruction error is analysed and this dual learn-
ing game continues until the corresponding reconstruction error becomes negligible, i.e.,
until the two models convergence. Having two models for round-trip translation toward
validation, the authors names it as “dual-NMT”. Notably, the experimental results in Eng-
lish-French translation show that the Dual learning is effective with small datasets and it
outperforms the baseline models, such as Bahdanu et al. model (Bahdanau et al. 2014) and
Sennrich et al. model (Sennrich et al. 2015) having a small share of dataset. In particular,
the results validates that Dual learning with 10% data can outperform baseline models with
100% data. Therefore, we can infer that the model is very much effective in a small dataset.
Also, to address the bottleneck of low-resource bilingual training dataset, the authors,
Ahmadnia et al. in Ahmadnia and Dorr (2019) present a round-trip training with improving
translation accuracy. In general, the model incorporate the dual learning (He et al. 2016) to
learn from unlabeled text and afterwards transcends the learning through effective lever-
aging of monolingual data. In particular, the model synchronously performs self-training
with a small labelled dataset and makes improvements with co-training while incorporating
unlabelled dataset into the training dataset. The experimental results with English-Spanish
as a high resource language (HRL) pair and Persian-Spanish as a low resource language
(LRL) pair show that the model outperforms the baseline models (e.g., as Bahdanu et al.
model Bahdanau et al. 2014).

4.2 Pivot language technique having small size bilingual corpora

To address the issue of small size bilingual corpora, we observe that pivot language can
help us in a graceful way. The authors, Ahmadnia et al. in Ahmadnia et al. (2017) present
how the pivot language technique can help optimize the accuracy of the low-resource Sta-
tistical Machine Translation, e.g., Persian–Spanish. In particular, the authors use English

13
Machine translation and its evaluation: a study 10187

as the pivot (bridging) language, while the Persian–English SMT is joined with the Eng-
lish–Spanish SMT. In this model, the large corpora of English–Spanish is used in the
pairing of the Persian–Spanish Translation. The experimental results show that the pivot/
bridging language technique outperforms the direct Persian-Spanish SMT models. Besides,
the authors show that in Persian–Spanish SMT models, the phrase-level pivot strategy out-
performs the sentence-level one. Moreover, the authors analyze a combined model with
the standard direct model and the best triangulation pivoting model and infer that we can
achieve better quality in translation.
Besides, Dabre et al. (2015) show how the limitations of small size corpora can be can
be leveraged using multiple pivot languages. In particular, they present how the multiple
pivot languages can help boost translation having small size multilingual parallel corpora
for Japanese-Hindi SMT models. In the model, they stick to same source and target corpus,
however, the model extract phrase pairs from the source and target corpus using a cou-
ple of pivot languages that lead to better translation accuracy. With the method, they use
the Multiple Decoding Paths (MDP) feature of Moses. The feature empirically verifies the
various methods used in their analysis and help find the best one. In general, their analysis
shows that pivot languages can address the limitations of small size corpora. In particular,
they show that pivot/bridging languages help learn the new phrase pairs. Notably, when the
direct source-target corpora is small, these (the extra phrase pairs) are not learned. With
experimental results, the authors show that multiple pivot languages help improve BLEU
points up to 3 for the Japanese-Hindi translation compared with one pivot. We see that the
authors simultaneously use 7 pivot languages for the decoding and come with a substantial
improvement in translation quality.
Above all, we observe that round-trip translation and pivot language help us optimize
the translation quality both for NMT and SMT having low-resource bilingual dataset or
a small size bilingual corpora (Ahmadnia et al. 2019; Cheng 2019; Currey and Heafield
2019; Dabre et al. 2021; Moon et al. 2020; Guo et al. 2021; Ko et al. 2021; Imankulova
et al. 2019; Ranathunga et al. 2021; Zhou et al. 2019).

4.3 Knowledge graph in NMT enhancement

As we know, NMT shows weak robustness in translating unknown words, rare words, or
OOV words. Specifically, it is challenging to translate entities (i.e., proper nouns and com-
mon nouns) and terminological expressions. Also, we observe that the parallel sentence
relationships of bilingual corpora can not explicitly make use of word-level relationships
for semantic relations and disambiguation of words. Consequently, these pose myriad chal-
lenges to NMT enhancement. Principally, semantic relations and disambiguation of words
are discovered and maintained simply during the training step. However, this process is not
supported by current NMT schemes. Also, only the training data can not ensure optimum
translation quality. Generally, enriched vocabulary and maintaining semantic relations and
disambiguation of words can help yield better translations. The authors, Ahmadnia et al.
(2020), Moussallem et al. (2019), Zhao et al. (2020a, 2020b) show how we can leverage
Knowledge Graphs (KGs) (Auer et al. 2007; Bollacker et al. 2008; Chah 2018; Färber et al.
2015; Lu et al. 2018; Rebele et al. 2016; Ringler and Paulheim 2017) toward extracting
semantic features, retaining and enhancing semantic relations, enriching vocabulary, and
adding other viable functionalities in achieving better translations.
Ahmadnia et al. (2020) and Moussallem et al. (2019) show the usage of entities and
relations as embedding constraints in existing KGs toward optimizing the mapping from

13
10188 S. K. Mondal et al.

source to target language translations for NMT. The embedding constraints are monolin-
gual and bilingual. As we know, the KG knowledge is represented by a triple where each
triple consists of a head/subject - generally an entity (e.g., Joe Biden), a relation- gener-
ally a property (e.g., president), and tail/object - generally an entity, a terminology, or a
literal (e.g., America). In the end, they assign a unique identity to entities and associate tri-
ples by making use of Entity Linking (EL). In particular, with the monolingual embedding
constraint, the semantic relations of source words are extracted, retained, and enhanced
through accessing the entity relations in a KG. Apart from that, with bilingual embed-
ding constraint, the source language entity relations are transferred and maintained in the
corresponding translations. In short, this scheme helps us leverage external knowledge in
the translation process and helps extract, retain, and enhance semantic relations between
source and target translations (Ahmadnia et al. 2020; Lu et al. 2018; Moussallem et al.
2019). In this way, it helps decrease the OOV/unknown words and helps optimize the
translation quality. Notably, Ahmadnia et al. Ahmadnia et al. (2020) evaluate and validate
their method with WMT English-Spanish translation task datasets using Freebase KG
(Bollacker et al. 2008; Chah 2018). On the other side, Moussallem et al. (2019) perform
the analysis with WMT English-German translation task datasets exploiting the DBPedia
KG (Auer et al. 2007) in validating their hypothesis. Notably, they carried out an extensive
manual and automatic analysis to the end.
The shortcoming of the study of Ahmadnia et al. (2020) and Moussallem et al. (2019)
is that they consider/include the entities that only appear in KGs and parallel training sen-
tence pair datasets. Especially, KGs carry many entities that may not exist in the train-
ing sentence pair. As a result, this seriously affects the translation quality. To address this
issue, Zhao et al. (2020a) propose a method comprising the target KGs and the non-parallel
source together. They evaluate and validate their method for Chinese-to-English transla-
tion tasks using YAGO KG and English-to-Japanese translation tasks using DBpedia KG
(Zhao et al. 2020a).

4.4 Recent advances in MT (esp. NMT)

Robust NMT (RoNMT), Cheng et al. (2019): We observe that NMT are often vulnerable to
noisy perturbed input. To this end, Cheng et al. (2019) introduce a method to optimize the
robustness of models. Specifically, their approach is divided into 2 segments: (1) attacking
the NMT translation model by adversarial data, (2) defending the model by adversarial
inputs to optimize its robustness contrary to the adversarial source. Their experiment vali-
dates that their approach help optimize robustness on noisy input.
Adversarial Augmentation for NMT (AdvAug),Cheng et al. (2020): We observe that vici-
nal risk (Chapelle et al. 2000; Ravikumar et al. 2020) downgrades translation accuracy.
To minimize the vicinal risk, Cheng et al. (2020) introduce an adversarial augmentation
approach, AdvAug for NMT of virtual sentences. Notably, this approach works augment-
ing datasets without adding new extra datasets while minimizing the vicinal risk. Thereby,
with this approach, the authors perform data augmentation while improving translation
quality.
Augmenting Back-Translation (BT) with Hints (HintedBT), Ramnath et al. (2021):
We know that Back-translation (BT) or Round-trip Translation helps optimize data
augmentation for monolingual corpora in NMT, exceptionally for LRL pairs. To boost the
efficacy of the accessible BT data, Ramnath et al. (2021) introduce a set of techniques,
called HintedBT that delivers hints (by tags) to the encoder module and decoder module.

13
Machine translation and its evaluation: a study 10189

In particular, with HintedBT, the authors first introduce a method of using both low and
high BT data by delivering hints to the model. This method is to evaluate the quality of
source-target language pair. Besides, with HintedBT, the authors solve an issue of cross-
script translation and transliteration tasks. Especially, the method first determine whether
a source string is required to be translated or transliterated to the target language or not.
If required, then the model is trained with additional hints that boost the efficacy of Back-
translation. The experimental analysis shows that translation quality can be optimized at a
large margin.
Improving Naturalness of MT (Natural Diet), Freitag et al. (2022): We observe that the
evaluation of MT entirely focuses on accuracy and fluency without paying attention to the
naturalness of it. In other words, even though the target translation is considered accurate
and fluent, it may lack naturalness compared with the high quality human translations or
original text in the target language. Especially, the (target) translation carries little lexi-
cal diversity and builds the translation that mimics those in the source string. To this end,
Freitag et al. (2022) introduce a method to obtain a likewise natural translation (like the
original text in the target language). To achieve this, the proposed method works in tagging
parallel training data having the naturalness of the target data instead of models trained on
translated data and natural data. Notably, tagging helps impose better emphasis on target
strings basically penned in the target language. Their experimental analysis and also the
automatic evaluation show that the proposed model/method achieves higher lexical diver-
sity resembling the human translations.
Multilingual Mix,Bapna et al. (2022): To optimize the multilingual NMT models, the
authors, Bapna et al. (2022) introduce a multilingual crossover approach. The approach
promotes sharing inputs and outputs across languages. To optimize fusion of samples in
multilingual models, the authors propose a set of techniques to optimize interpolation
across differing languages under extreme data inconsistency. The experimental analysis
shows that the approach achieves higher translation quality on English-to-ManyLang, Man-
yLang-to-English and zero-shot MT tasks.

5 SMT vs. NMT

In this paper, we work mainly with NMT and SMT. As such, in this section, we show a
comparative analysis between SMT and NMT. First, we show a theoretical comparison.
Thereafter, we comparatively analyze their performance.

5.1 Theoretical comparison

SMT is built on Bayesian framework (Brown et al. 1993) which adopt the noisy channel
model and regard the machine translation problems as probability estimation (Bentivogli
et al. 2016; Koehn et al. 2003). Like SMT, NMT also derives translation by estimating
probabilities, whereas it differs from SMT in terms of the implementation forms. In SMT,
the model can be separated into several sub-modules. For example, the language model, the
translation model, the reordering model, etc. These components work together to imple-
ment the translation function. On the other hand, NMT adopts the NN to translate from the
source language to the target language directly and the NN is also equipped with the align-
ment or reordering functions similar to SMT. NMT with attention mechanism dynamically
obtains the source language information which is correlated to the current generating word.

13
10190 S. K. Mondal et al.

Table 15  NMT versus SMT Parameter NMT SMT


(Besacier and Blanchon 2017)
Unit/Core Element Word vectors Words or Phrases
Knowledge Learned weights Phrase table
Representation Continuous Discrete
Model Non-linear Linear
Model Size Smaller Large
Training Time Long Shorter
Training Pipeline Elegant Complex
Interpretability Low Medium
Introducing Linguistic Doable Doable
Knowledge

Therefore, NMT can derive the corresponding alignment information without setting up an
alignment model as in SMT (Bentivogli et al. 2016).
In sum, the core elements of NMT are word vectors (Singh et al. 2017) while the core
elements of SMT are words or phrases. Besides, the learning parameters in NMT are
weights for the model and in SMT are phrase tables. Notably, NMT models are non-linear
and SMTs are linear (Besacier and Blanchon 2017).
Especially, to get to know about the comparative analysis right away, we present it in the
Table 15.
We would like to note that some of the points in the table are illustrated throughout
the paper and some of them are self-explanatory. However, we do not cover the final row,
Introducing Linguistic Knowledge in the paper. Therefore, in this section, we
briefly demonstrate this.

5.1.1 Introducing linguistic knowledge in NMT

As we know, the NMT requires a finite set of vocabulary. To this end, Casas et al. (2021)
introduce a set of viable vocabulary definition schemes which are linguistically-grounded
and take advantage of both the sub-word vocabularies and the word-level vocabularies.
Their empirical analysis validates their proposition while optimizing the translation quality.
Notably, MT encounters the issue of gender bias. To mitigate the issue, the Kharitonova
(2021) introduce an approach adopting a factored Transformer that can help integrate lin-
guistic features and finally help lessen the gender bias in translation. Notably, their experi-
mental analysis validates their approach while minimizing the gender bias.
Also, we observe that in some languages (especially in LRLs) it is difficult to under-
stand the morphology which leads to translation task being much harder. To this end,
Ortega et al. (2020) proposes an NMT system for a LRL, Quechua (widely spoken in
South America) experimenting with numerous morphological segmentation approaches.
Finally, they bring in Quechua–Spanish MTE for the community integrating a brand-new
morphological segmentation approach for decomposing the morphemes with suffixes.

13
Machine translation and its evaluation: a study 10191

Table 16  The evaluated MT Systems, Bentivogli et al., 2016, reproduced from the information in Ben-
tivogli et al. (2016)
System Approach Data

PBSY (Huck and Birch 2015) Combination of Phrase and Syntax-based string-to-tree; 175 M/
Hierarchical and sparse lexicalized reordering models 3.1B
HPB (Jehl et al. 2015) Hierarchical Phrase-based source pre-ordering; 166 M
Rescoring with neural LM /854 M
SPB (Bentivogli et al. 2016) Standard Phrase-based source pre-ordering; 117 M
Rescoring with neural LMs /2.4B
LSTM attention-based; 120 M/
NMT (Luong et al. 2015) Source reversing; rare words handling –

Table 17  The TERs And BLEU System BLEU HTER mTER


Scores, Bentivogli et al., 2016,
reproduced from the information
PBSY 25.3 28.0 21.8
in Bentivogli et al. (2016)
HPB 24.6 29.9 23.4
SPB 25.8 29.0 22.7
NMT 31.1∗ 21.1∗ 16.2∗

Fig. 36  Average mTER Of the MT Outputs Against Different Length of Sentences, Bentivogli et al., 2016,
reproduced from the information in Bentivogli et al. (2016)

5.1.2 Introducing linguistic knowledge in SMT

Nießen and Ney (2000) and Gispert Ramis (2007) show that using morphology and
syntactic information help optimize the translation quality of SMT while validating the

13
10192 S. K. Mondal et al.

Fig. 37  mTER Scores Per Talk, Bentivogli et al., 2016, reproduced from the information in Bentivogli et al.
(2016)

Table 18  Automatic MT Metrics BLEU METEOR WER TER


evaluation metrics (BLEU,
METEOR, WER, TER) score,
SMT 0.34 0.48 0.50 0.52
Stasimioti et al. (2020). NMT
outperforms SMT NMT 0.39 0.52 0.49 0.51
tailored-NMT 0.46 0.56 0.43 0.39

proposition with empirical study.


As we know, reordering is a crucial task for SMT. To this end, Zhang et al. (2007) use
syntactic knowledge to optimize the global reordering of SMT while compensating reor-
dering errors arisen by a parse tree. Their experimental analysis validates their proposition.

5.2 Performance comparison

We observe that in recent years, extensive experimental analyses have been conducted to
compare NMT and SMT. It is proved that NMT outperforms others. Therefore, we show
the comparative analysis of these translation approaches and explain why NMT is the state-
of-the-art approach in today’s world. Notably, the performance results of NMT and SMT
are shown with different evaluation metrics, such as BLEU, NIST, METEOR, ROUGE,
WER, TER, and others, as discussed in this article.

5.2.1 Bentivogli et al. (2016)

In 2015, the results of the International Workshop on Spoken Language Translation


(IWSLT) evaluation campaign (IWSLT 2018) have shown that NMT outperforms SMT
systems on English-German language pair based on the automatic transcription and

13
Machine translation and its evaluation: a study 10193

translation of TED talks. Bentivogli et al. (2016) worked on finding the mysteries under the
hood.
Bentivogli et al. (2016) select one NMT system (Luong’s Attention Model Luong et al.
2015) and three phrase-based MT (PBMT), including a standard phrase-based approach
(SPB) (Bentivogli et al. 2016), a hierarchical approach (HPB) (Jehl et al. 2015), as well as a
syntax-based and phrase-based combination approach (PBSY) (Huck and Birch 2015). The
test set is 600 sentences for roughly 10,000 words and the post-edited output by five profes-
sional human translators. The Table 16 briefly presents the summary of the approaches and
corpora/dataset size used in the evaluation. Note that the Data column is the size of paral-
lel training data of English and German, respectively.
Table 17 illustrates the score of those four MT systems in TERs (Snover et al. 2006)
and BLEU (Papineni et al. 2002). As stated earlier, TER is the acronym of Translation Edit
Rate (Snover et al. 2006), can trace the edits from the MT systems compared to the post-
edit data. Specifically, HTER denotes Human-targeted TER, which compares the MT out-
put with manually post-edited version. On the other hand, mTER denotes Multi-reference
TER, which is based on the closest translation among all available post-edits for each sen-
tence. Notably, the lower the TER value, the better. We observe that NMT achieves the best
results amongst in BLEU score, and its TERs are also the lowest, which states that NMT
outperforms the other approaches.
Notably, the MT systems are evaluated with varying length sentences and with different
talks/speeches that compose the dataset. Figure 36 is the depiction of the result of the sen-
tence length test. Again, we can notice that NMT occupies the lowest mTER and its score
is roughly 4% to 6% lower than the SMT approaches. Interestingly, in the NMT system,
when the sentence length is more than 35 words, the translation quality degrades more
rapidly than PBSY, SPB and HPB. The long-sentence issue is one of the issues for NMT
to overcome. Note that attention based NMT and transformer based NMT help address this
issue optimally as discussed formerly, in this article.
Figure 37 plots the experiment result of talks in the TED Talks datasets. Note that
these talks are diversified in different topics and presentation styles from different anchors,
which may be the factors affecting the quality of the translation. For this reason, three fac-
tors are classified according to the talks: the length of the talk, the average sentence length,
and the Type-Token Ratio (TTR). The TTR indicates the lexical diversity. The analysis
shows that there is modest Pearson correlation (Benesty et al. 2009) in TTR and the mTER
gains of NMT over the other three systems in each talk. This result proves that NMT is
capable of dealing with lexical diversity better. Last but not least, Bentivogli et al. (2016)
also show that NMT makes less morphology and lexical errors than SMT.

5.3 Yamada et al., 2019 (Yamada 2019)

The authors, Yamada et al. in Yamada (2019) present a comparative study of post-editing
(PE) potential. In particular, it investigates the performance of college students having a
comparative analysis between SMT and Google NMT. The experimental results show that
NMT with PE (NMT+PE) outperforms SMT with PE (SMT+PE).

13
10194 S. K. Mondal et al.

Table 19  Length of Sentences of System Google Translate Baidu Translate Bing


three MT systems both in SMT Trans-
and NMT, Liu (2020) lator

NMT 37 30 20
SMT 21 21 21

NMT outperforms SMT in handling longer length sentences

5.4 Stasimioti et al., 2020 (Stasimioti et al. 2020)

In 2020, Stasimioti et al. in their paper (Stasimioti et al. 2020) present a comparative analy-
sis among three MT systems, such a generic SMT, a generic NMT, and a tailored-NMT
system. Notably, they perform the analysis in English-Greek language translation. The
experimental results show that NMT (both generic and tailored-made) outperforms SMT
while tailor-made NMT wins marginally, as shown in Table 18. As we know, the higher the
BLEU and METEOR scores, the better. By contrast, as stated earlier, the lower the TER or
WER, the better the system.

5.5 Liu et al., 2020 (Liu 2020)

Liu (2020) present a comparative analysis and the usage of cohesive devices in Chinese-
English translation by three commonly used MT systems, such as Google Translate, Baidu
Translate, and Bing Translator (Notably, Bing Translator is one of the web apps of Micro-
soft Translator) compared with human translator. Their experimental analysis shows that
NMT outperforms SMT in handling cohesive ties such as pronouns, adverbs, additives, and
others. Besides, they show that NMT can more decently handle varying and longer sen-
tence length than SMT, as shown in Table. 19. Besides, Table 20 shows that the cohesive
devices can be used toward better translation and also presents that NMT surpasses SMT in
handling cohesive ties.

5.6 Islam et al., 2021 (Islam et al. 2021)

The authors, Islam et al. in Islam et al. (2021) presents a comparative study of Bengali-
English translation among RBMT, NMT, and SMT. They also perform an extended analy-
sis blending different MT approaches together due to low-resources of languages, such as
Arabic, Bengali, Nepali, and many others. They also present how we can reuse the blend-
ing approach for different low-resource languages, like Arabic, Bengali, Nepali, and many
others. Their experimental results show that NMT blended with RBMT (BLEU score
18.73) performs better than SMT with RBMT (BLEU score 18.02).

5.7 Summary

The analysis in this section and also the comparative analysis in the Sect. 3.4 among all the
NMT models prove that NMT outperforms SMT. Specifically, Google NMT wins the race
against Google SMT. More specifically, Google Translate is the most used app for
translation task.

13
Table 20  Cohesive ties as references and conjunctions, reproduced from the information in Liu (2020)
Cohesive Ties Human Trans- Google-SMT Google-NMT Baidu-SMT Baidu-NMT Bing-SMT Bing-NMT
lator

Conjunction Additive 44 37 47 42 45 41 38
Machine translation and its evaluation: a study

Adversative 3 4 3 3 3 3 3
Causal 1 0 0 2 0 0 0
Temporal 1 0 0 1 0 0 0
Sub-total 49 41 50 48 48 44 41
Reference Demonstratives 11 3 5 3 3 4 6
Definite article 56 47 50 73 55 39 46
Comparatives 4 8 7 6 7 8 9
Adverbs 3 1 2 3 3 3 3
Personal pronouns 11 12 11 7 10 10 9
Sub-total 85 71 75 92 78 64 73
Total 134 112 125 140 126 108 114

NMT surpasses SMT in handling cohesive ties


10195

13
10196 S. K. Mondal et al.

6 Corpora and datasets for SMT and NMT

We would like to note that in each section of the experimental analysis, we demonstrate
the corpora and datasets used by the specific methods or models for the evaluation of their
approach.
With Syntax-based SMT (Yamada and Knight 2001) (Sect. 2.3), a Japanese-English
dictionary was used, having a corpus of 2121 (translation sentence) pairs from the diction-
ary. Also, the authors use this corpus to evaluate the word-based IBM models.
With PBMT (Koehn et al. 2003) (Sect. 2.4), the authors used the publicly available
Europarl20 corpus (Koehn 2005) to perform the evaluation of their approach.
With NMT (section 3), the commonly used datasets are from the Statistical and
Neural Machine Translation -> Events.21 In particular, the translation tasks
datasets associated with the conference/workshop on Empirical Methods in Natural Lan-
guage Processing (EMNLP), especially the conference/workshop on machine translation
(WMT),22 are used. On the other hand, Google NMT (GNMT) uses Google’s internal pro-
duction datasets alongside the datasets from WMT workshop (ACL 2022a). With the T5
model, Google uses the pre-trained C4 datas​et alongside the datasets from the WMT work-
shops and conferences.
In general, the commonly used corpora and datasets for MT models are from the WMT
conferences. In particular, the datasets are primarily from the sources of Europ​arl, ParaC​
rawl, Commo​n Crawl​ corpus, News Comme​ntary, Wiki Titles, WikiM​atrix, UN Paral​lel
Corpus, CzEng, Tilde​ MODEL​ corpus, CCMT Corpus, Yande​x Corpus, OPUS, Back-​trans​
lated​ news, Japan​ese-​Engli​sh Subti​tle Corpus, The Kyoto​Free Trans​latio​n Task Corpus,
TED Talks, ELRC -​EU acts in Ukrai​nian, Yakut​ paral​lel and monol​ingua​l data, ParIce, 109
French-English corpus, and many others. Notably, we can get both parallel data and mono-
lingual data from these sources except for a few where no monolingual data is available.
Also, we would like to note that some of the monolingual corpora are the superset of the
aforementioned parallel corpora and are also available from other reliable sources, acces-
sible under WMT conferences. We observe that the characteristics and parameters of most
corpora and datasets are revised from time to time, especially with a new release. For this,
in the paper, we do not present the details of them. On the other hand, we add the web link
of the listed corpora and datasets, so that the readers can get the most up-to-date informa-
tion easily.

6.1 Europarl Corpus

We observe that the Europarl (Koehn 2005; Eur 2022) is a parallel corpus that is drawn
from the deliberations of the European Parliament. It is available in European languages,
including Romanic (Spanish, French, Italian, Romanian, Portuguese), Germanic (German,
English, Dutch, Swedish, Danish), Slavik (Czech, Bulgarian, Polish, Slovene, Slovak),
Finni-Ugric (Hungarian, Finnish, Estonian), Baltic (Lithuanian, Latvian), Greek, and oth-
ers. The datasets and corpora are available under the release versions of v1, v2, v3, v5,

20
Europarl Datasets https://​www.​statmt.​org/​europ​arl/.
21
Statistical and Neural Machine Translation → Events https://​www.​statmt.​org/.
22
WMT22 https://​www.​statmt.​org/​wmt22/, WMT21, WMT20, and so on.

13
Machine translation and its evaluation: a study 10197

v6, v7, v8, v9, v10. Notably, in the newer releases some of the datasets and corpora are
revised, added, or deleted, for this, we show all the released versions herein.

6.2 ParaCrawl Corpus

The ParaCrawl corpus (Esplà-Gomis et al. 2019; Par 2022) is an ongoing web-wide pro-
vision of parallel corpus for European languages. Its current release, ParaCrawl v9 is
available in 42 languages including European and Chinese, especially between English and
the corresponding languages. Notably, all the release versions are accessible from the page
ParaC​rawl corpo​ra.

6.3 Common Crawl Corpus

The Common Crawl corpus (Buck et al. 2014; Luccioni and Viviano 2021; Common
Crawl corpus 2022) includes a large volume (petabytes) of web-crawled data collected
since 2011. It includes text extracts, metadata extracts and raw web page data. The datasets
can be accessed from Commo​n Crawl​ corpus.

6.4 WikiMatrix Corpus

The WikiMatrix, is an extraction of 135 M Parallel Sentences from Wikipedia in 1620


Language pairs (Schwenk et al. 2019; Wikimatrix 2022). It is an ongoing project toward min-
ing parallel sentences for all possible language pairs. Notably, we can get the detail of the cor-
pus, supporting pairs, and others from the page WikiM​atrix​ Info and we can access a subset of
the datasets from the page WikiM​atrix​ v1.

6.5 Wiki Titles Corpus

The Wiki Titles is a parallel corpus consisting of bilingual titles of Wikipedia articles.
It is expanded with title redirections and text links. We observe that with 253 parallel corpus
files, it includes 63,573,278 titles of bilingual articles. We can get the detail of the corpus, sup-
porting pairs, and the sources of the supported datasets from the page Wiki Title​s corpus.

6.6 United Nations (UN) Parallel Corpus

The United Nations (UN) Parallel corpus, UN Parallel corpus comprises UN pub-
licly accessed official records and parliamentary documents (Ziemski et al. 2016; UNP 2022)
which are primarily available in all six official UN languages. We can get the detail of the cor-
pus, supporting pairs, and the sources of the datasets from the page UN paral​lel corpus.

6.7 Czech‑English (CzEng) Parallel Corpus

The CzEng parallel corpus is a sentence-parallel Czech-English corpus. Currently, it has six
different releases. We can get the detail of the corpus and the datasets from the page CzEng​
paral​lel corpus.

13
10198 S. K. Mondal et al.

6.8 Tilde Multilingual Open Data for European Languages (MODEL) Corpus

The Tilde MODEL Corpus is the commitment of Tilde toward creating new multilin-
gual corpora for European Union (EU) languages. It primarily focuses on the smaller (low-
resource) languages that need them the most. We can get the detail of the corpus, supporting
pairs, and the datasets from the page Tilde​ MODEL​ corpus.

6.9 Chinese Conference Machine Translation (CCMT) Corpus

The CCMT corpus consists of news articles with monolingual Chinese, parallel Chinese-
English, and multiple references. We can get the detail of the corpus and access to the dataset
from the page CCMT corpus.

6.10 Yandex (English‑Russian parallel) corpus

The Yandex corpus includes Russian-English parallel sentences collected from the
web. We can get the detail of the corpus and access to the dataset from the page Yande​x
corpus.

6.11 Open Source Parallel Corpus (OPUS) Corpus

The OPUS is an ever-growing multilingual translated texts corpus that is available as open
source. We can get the detail of the corpus, supported pairs, and the sources of datasets
from the page OPUS corpus.

6.12 Japanese‑English Subtitle Corpus (JESC)

As the title carries, the JESC is the Japanese-English subtitle corpus crawling the subtitles
of web movies and television programs (Pryzant et al. 2018). The speciality of this dataset
is that it encompasses poorly represented colloquial phrases. We would like to note that it
is the collaborative effort of Google Brain, Stanford University, and Rakuten Institute of
Technology. We can get the detail of the corpus and download the dataset from the page
JESC corpus.

6.13 The Kyoto Free Translation Task (KFTT) Corpus

The KFTT is the Japanese-English translation corpus that focuses on articles from Wikipe-
dia aligned to Kyoto. We can get the detail of the corpus and download the dataset from the
page KFTT corpus.

6.14 ParIce Corpus

The ParIce is a parallel English-Icelandic corpus consisting of diverging subcorpora col-


lected from the web and scratch. We can get the detail of the corpus and download the
dataset from the page ParIc​e corpus.

13
Machine translation and its evaluation: a study 10199

Table 21  Language Symbols Symbol Language Symbol Language Symbol Language

AR Arabic BN Bengali CS Czech


DE German EN English ES Spanish
ET Estonian FI Finnish FR French
HA Hausa HI Hindi HR Croatian
IS Icelandic IT Italian JA Japanese
LIV Livonian LV Latvian LT Lithuanian
PL Polish PT Portuguese RO Romanian
RU Russian SAH Yakut SV Swedish
UK Ukrainian XH Xhosa ZH Chinese
ZU Zulu

Notably, another two popular datasets are the JRC-​Acquis (Steinberger et al. 2006) and
Arab-​Acquis (Habash et al. 2017). The JRC-Acquis is a parallel corpus consisting of 22
European Union (EU) languages. The Arab-Acquis is also a parallel corpus built on top of
JRC-Acquis, consisting of data between Arabic and 22 European Languages.
In sum, we briefly present the corpora and the supported translation language pairs in
Table 22. We observe that the supported pairs are revised with a new release, so the table
is subject to change with time. For an easy depiction of the table, we use the language sym-
bols as shown in Table 21. Notably, we do not present all the languages herein, we simply
go with a subset of languages, as shown below:

7 Gender bias and ethical pitfalls in machine translation

We observe that MT encounters many of gender bias and ethical issues. For example, an
MT system performs translation from Hungarian or Bengali into English. Notably, the
Hungarian and Bengali (Bengali or Bangla is the native language of one of the authors
of this paper), both have gender-neutral pronouns. On the other hand, as we know, Eng-
lish has grammatical gender. Therefore, when we translate a reference to a person without
specifying any gender, the MT systems often default to male gender - may cause harm to
the users and community at large (Jurafsky and Martin 2022; Prates et al. 2020; Rescigno
et al. 2020; Savoldi et al. 2021; Schiebinger 2014; Stanovsky et al. 2019). Besides, we find
that MT systems mostly assign genders in accordance with cultural stereotypes. To validate
this, we perform translation of a set of sentences having gender-neutrality and common
professions. Notably, the translation is carried by two of the commonly and popularly used
mainstream Machine Translation systems. Due to privacy issue, the names of the transla-
tion systems are not listed in the paper. Consequently, we show the example translations
of Hungarian into English in Fig. 38 and 39, respectively. Besides, we show the example
translations of Bengali into English in Fig. 40 and 41, respectively.
Moreover, we observe the same example translations of Hungarian into English from
Prates et al. (2020) and find the similar translations as stated earlier in Fig. 38, and 39.
The gender-neutral Ö Hungarian is a nurse is translated with she, but gender-neutral Ö is a
Wedding Organizer or Wedding Planner is translated with he.
Besides, Stanovsky et al. (2019) show that the translations by MT systems is not good
when we ask to translate sentences describing people with non-stereotypical gender roles.

13
Table 22  The supported language translation pairs with different corpora and datasets. Notably, it is subject to change with the new release or time
10200

Title of the AR BN BN CS CS DE ES ES ET FI FR FR HA HR IS IT JA LIV LT LV PL RO RU SAH SV UK UK XH ZH


Corpus

13
EN HI EN EN PL EN EN PT EN EN DE EN EN EN EN EN EN EN EN EN EN EN EN RU EN EN CS ZU EN

Europ​arl ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ParaC​rawl ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Commo​ ✓ ✓ ✓ ✓ ✓
n Crawl​
Corpus
News ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Comme​
ntary
Wiki Titles ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
WikiM​atrix ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
UN Paral​lel ✓ ✓ ✓ ✓ ✓
Corpus
CzEng ✓
Tilde​ ✓ ✓ ✓ ✓ ✓ ✓ ✓
MODEL​
corpus
CCMT ✓
Corpus
Yande​x ✓
Corpus
OPUS ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Corpus
Back-​trans​ ✓ ✓ ✓
lated​
news
JESC ✓
KFTT ✓
Corpus
S. K. Mondal et al.

TED Talks ✓
Table 22  (continued)
Title of the AR BN BN CS CS DE ES ES ET FI FR FR HA HR IS IT JA LIV LT LV PL RO RU SAH SV UK UK XH ZH
Corpus

EN HI EN EN PL EN EN PT EN EN DE EN EN EN EN EN EN EN EN EN EN EN EN RU EN EN CS ZU EN

ELRC-​ ✓ ✓
Ukrai​
nian
Yakut​ data ✓
ParIce ✓
109French- ✓
English
corpus
Machine translation and its evaluation: a study
10201

13
10202 S. K. Mondal et al.

Fig. 38  Hungarian to English Translation by commonly used Machine Translation system. We see that the
gender-neutral Ö in Hungarian is a nurse is translated with she, but gender-neutral Ö is a wedding planner
is translated with he. [First trial translation by one of two anonymous translation systems mentioned earlier]

Fig. 39  Hungarian to English Translation by commonly used Machine Translation system. We see that the
gender-neutral Ö in Hungarian is a nurse is translated with she, but gender-neutral Ö is a wedding planner is
translated with she. Note that in first trial translation, in Fig. 38, gender-neutral Ö is a wedding organizer is
translated with he. On the other hand, gender-neutral Ö is a teacher is translated with she. Note that in first
trial translation, in Fig. 38, gender-neutral Ö is a teacher is translated with he. [Second trial translation by
another anonymous translation system mentioned earlier]

For example, “The doctor asked the nurse to help her in the operation”. We get the transla-
tion of this sentence in Spanish as “El doctor le pidio a la enfermera que le ayudara con el
procedimiento”. We observe that in the source English sentence “the doctor” indicates a
female gender, but in the target Spanish sentence, it is translated to male doctor “El méd-
ico” which is wrong. On the other hand, the gender of the nurse is unidentified, however,
in the translation, it is considered as the female gender. Notably, gender bias degrades the
quality of service, as we may require additional time, effort, and energy to verify incorrect
gender references (Savoldi et al. 2021).
To the end, we review a number of research works dealing with mitigating gender bias
in MT and overhaul the underlying principles, effectiveness, and limitations of prominent
gender bias mitigating methods. Toward mitigating gender bias, Savoldi et al. (2021) intro-
duces a set of prominent approaches categorizing them into two, such as model debi-
asing and debiasing through external components. In model debi-
asing, the authors focus on changing the architecture of MT models and/or training
procedures with word-level and sentence-level gender tagging, adding context, debiased

13
Machine translation and its evaluation: a study 10203

Fig. 40  Bengali to English Translation by commonly used Machine Translation system. We would like to
share that first word is the gender-neutral pronoun in Bengali and we see the translations are alike as in
Fig. 38, i.e., a nurse is translated with she. On the other hand, a wedding organizer is translated with he.
[First trial translation by one of two anonymous translation systems mentioned earlier]

Fig. 41  Bengali to English Translation by commonly used Machine Translation system. We see the transla-
tions are alike as in Fig. 39 i.e., a nurse is translated with she. Besides, a wedding organizer is translated
with she. On the other hand, a teacher is translated with he. Note that in first trial translation, in Fig. 40, a
wedding organizer is translated with he. [Second trial translation by another anonymous translation systems
mentioned earlier]

word embeddings, and balanced fine-tuning. On the other hand, in the second category
(debiasing through external components) of bias mitigation, the authors focus on introduc-
ing additional components, such as black-box injection, lattice re-scoring, and gender re-
inflection with the MT model instead of implying retraining. In the particular, the black-box
injection is for controlling the generation of feminine references and also for improving the
ability of the MT model toward generating feminine forms. The other two are self-explana-
tory, so we do not demonstrate them herein. Similarly, Sun et al. (2019) present a literature
review on mitigating gender bias in NLP. In particular, the author shows the effectiveness,
limitations, and future directions of “gender debiasing methods”. The effectiveness of the

13
10204 S. K. Mondal et al.

methods is achieved/validated with the approaches as in Savoldi et al. (2021). On the other
hand, some of the limitations in their analysis are summarized as follows: (1) the majority
of debiasing methods focus on a single, modular process, as it remains undiscovered how
the individual components align together to build an elegantly unbiased system. (2) empiri-
cal evaluation and verification of most debiasing methods are only performed for limited
applications. (3) some debiasing approaches introduce noise to the model, thereby results
in the quality of service degradation.
We observe that research works in the gender bias mitigating domain mostly focus on
mitigating the bias in English. Especially, it is also important to reduce gender bias in lan-
guages other than English. Meanwhile, it is further important to have non-binary gender
bias (Richards et al. 2016; Saunders et al. 2020). In this regard, we find that interdisci-
plinary collaboration can help to reduce gender bias significantly (Ullmann 2022). For
example, introducing linguistic knowledge in MT as discussed in Sect. 5.1.1. The cross-
disciplinary research can help us in a better way. Another viable means for bias reduc-
tion is domain adaptation, introduced by Tomalin et al. (2021). The authors show that the
domain adaption with transfer learning having gender-balanced datasets can be effectively
used to debias MT systems and can result in less biased translations. We also demonstrate
one of the recent works by Google Research,23 by Sun et al. (2021). The authors, in their
study, show the ways of rewriting gender-neutral English, training/building an NMT model
for generating gender-neutral English, and guiding toward minimizing the ethical issues in
Machine Translation. Their experimental analysis shows the model can generate gender-
neutral English very optimally (BLEU score 99, the more the better and WER score 0.99,
the less the better). Google’s another research work by Johnson (2020), shows the ways of
reducing the gender bias of gender-specific translations for a set of languages, by giving
the translations for both the male-gendered and female-gendered output and letting the user
choose the specific one to use (Johnson 2020).
Besides, we observe that ethical issues in MT systems require further careful study. For
example, in courts for judicial actions, assisting judges or legal advisories in communi-
cation with witnesses or accused persons (Jurafsky and Martin 2022; Vieira et al. 2021).
To the end, systems need viable means of assigning confidence values to candidate trans-
lations, so that they can refrain from providing incorrect translations that can be harm-
ful (Jurafsky and Martin 2022; Wahler 2018). Also in hospitals/medicals, we know that
wrong translations of medical documents, conversations between clinicians and patients,
or health education can cause severe harm (Hill et al. 2022; Mehandru et al. 2022; Vieira
et al. 2021). Notably, to address the issue, Mehandru et al. (2022) carried out an interview
(a qualitative study) of medical specialities and proposed design guidelines for developing
an appropriate MT system for medical specialties. The authors suggest combining MT with
pre-translated medical words/phrases, incorporating translation support with multi-modal
communication, and granting interactive support to test mutual understanding. However,
this implication requires extensive manpower, computing resources, skills, time, and many
others, so we have to find better alternatives. Another issue with the mistranslation of criti-
cal texts (not even critical texts) introduces risks (Bender 2019; Moorkens 2022; Sakamoto
2019). For example, auto-translation of an Arabic post “good morning” to “hurt them” led
to arrest (Bender 2019). Therefore, wrong translations can introduce risks or threats. To
this, Canfora and Ottmann (2020); Moorkens (2022) show that currently MT standards

23
Google Research: Machine Translation https://​resea​rch.​google/​resea​rch-​areas/​machi​ne-​trans​lation/.

13
Machine translation and its evaluation: a study 10205

do not take the liability or mention any risks for poor translations. On the other side, the
confidentiality of sensitive data is important though it is made public sometimes by some
open-source machine translators. Therefore, it is important to take decisions and also do
evaluation whether or not MT should be used in the public domain or any other. Especially,
it is significant to quantify the domain-wise performance (i.e., accuracy and quality) of an
MT system before using it. We also observe that some language models are anti-biased to
some communities which introduce social risks and threats (Abid et al. 2021; Weidinger
et al. 2021). Notably, we can minimize the bias by introducing words or phrases into the
context that provide strong positive associations within (Abid et al. 2021; Weidinger et al.
2021). We would like to state that due to privacy issues, we do not go into the details of
it. Another ethical issue is the sustainable AI (by extension sustainable NMT) develop-
ment and deployment, as it has an adverse impact on the environment ecosystem due to
vast power consumption and carbon emission (Kenny et al. 2020; Moorkens 2022; Strubell
et al. 2019; van Wynsberghe 2021). To this, Wynsberghe (van Wynsberghe 2021) suggests
developing and deploying sustainable AI, otherwise, AI will be in danger - will not sustain.
The author directs the AI ecosystem to pay thorough focus and careful attention to the
lifecycle of AI products toward sustaining the environment ecosystem for all generations,
the current and future. In particular, the author talks about sustainable AI, which
comprises Sustainability of AI and AI for Sustainability. The Sus-
tainability of AI primarily deals with finding reusable data and reducing carbon
emission and computational power by measuring them from the course of training toward
the sustainable development of AI applications and sustainable usage of AI models. On
the other hand, AI for Sustainability talks about exploring AI applications for
achieving sustainability. Consequently, we may have sustainable AI. Apart from this, it is
also important to follow the AI ethical considerations and guidelines, such as reducing data
bias, optimizing data quality, retaining confidentiality and ownership of data, transparency
of AI systems’ processing and decision making, no harm/risk due to misuse or misinterpret
by AI, and many others (Horváth 2022).
In sum, in this section, we review the gender bias and ethical pitfalls of machine trans-
lation. We also present the possible directions to address them. We observe that we need
further research and careful study to address the issues entirely.

8 Evaluation methods for machine translation

For every MT system, the output has to meet the quality expectations. To justify whether
the system or the algorithm is high-quality or not, people must establish reliable evaluation
methods for MT, manually or automatically. As one of the most difficult aspects of MT
field for a mature evaluation method, it is required to be able to find the compact correla-
tion of any source and target languages. Natural language does not have explicit representa-
tion like formulas or equations in mathematics, especially when ambiguity exists for most
natural languages. The significance of the MT evaluation includes (Jian Zhang 2003):

– For the developers of MT systems, the evaluation method reveals the pitfalls of the sys-
tem so that improvements can be made to the system complement.
– For the users of MT products, they can select suitable translation freely according to
their demands and the evaluation output.

13
10206 S. K. Mondal et al.

Table 23  The Intelligibility Measurement (Council 1416)


Category Description

9 Absolutely perfect and intelligible. Read like normal text without stylistic errors
8 Almost absolutely perfect and intelligible with a small number of grammatical or stylistic errors.
Sometime have uncommon but easily “corrected” usage
7 Almost acceptable and intelligible, but sentence style or choice of word are a little worse than
category 8
6 Almost immediately intelligible. The understanding result has poor grammatical and stylistic
errors. Some untranslated words and incorrect expressions. Editing and revision could give a
sufficient sentence
5 Intelligible only after sufficient study and reflection. After study, the understanding result has
poor grammatical and stylistic errors, including some “noise” while the main idea is discov-
ered
4 Pretend to be intelligible. Indeed it is unintelligible. The idea is partly discovered. There are very
poor grammatical and stylistic errors with many “noise” and untranslated words
3 Generally unintelligible. Although it mostly seems to be nonsense, it has a reasonable meaning
with effective study and reflection
2 Almost absolutely unintelligible after study and reflection. The sentence may have some mean-
ing but is useless
1 Absolutely unintelligible. No study and reflection could get the point of the sentence

Table 24  The Fidelity Measurement (Council 1416)


Category Description

9 Extremely informative. “Completely different” in understanding what is meant. (Rating 9 should


be given if the translation totally changes the meaning of original.)
8 Very informative. Helps clarify what is being said. By correcting sentence structures, words, and
phrases, it dramatically changes the reader’s impression of its meaning, although it does not
completely modify or alter the meaning
7 (Between 6 and 8.)
6 Distinctly informative. Includes a lot of information about the structure of sentence and single
words to keep reader track about their meaning
5 (Between 4 and 6.)
4 Adds a certain number of sentence structure and syntactic relationship information; it can also
correct small errors relating to familiar meaning of a sentence or the meaning of single words
3 Correct few possible key meanings. Mainly at the lexical level, it introduces lightly different
“distortions” to the meaning conveyed by the translation. However, it does not add new infor-
mation about sentence structure
2 The original does not add any real new meaning, either lexically or grammatically, but the reader
will feel that he or she understands the meaning of the original
1 Not having any information. No new meaning is added. The reader’s confidence in their own
understanding is not decreased or increased
0 The input text carries less information than the translation. The translator adds some meaningful
words, obviously in order to make the passage easier to understand

– For the researchers of MT, the evaluation result can provide reliable foundation to their
technological developments.

13
Machine translation and its evaluation: a study 10207

In order to support research activities of MT evaluation, many evaluation events and work-
shops are established internationally. For example, the Special Interest Group of Machine
Translation (SIGMT) (Koehn and Chiang 2019) organizes workshops under the roof of the
conferences of the Association for Computational Linguistics (ACL). The famous event
Workshop on (Statistical) Machine Translation (WMT) which mainly focuses on the trans-
lation task and evaluation task (Wang 2005) is held by SIGMT yearly, since 2006. Besides,
SIGMT also organizes the Workshop on Syntax and Structure in Statistical Translation
(SSST) annually since 2007. Furthermore, The International Workshop on Spoken Lan-
guage Translation (IWSLT) (IWSLT 2018) hosts an open evaluation campaign on spoken
language translation, with scientific papers presented annually. These workshops and cam-
paigns massively boost the developments on MT. In this section, the common approaches
of MT evaluation will be introduced, including human evaluation methods and automatic
evaluation methods.

8.1 Human evaluation methods

In this section, two traditional human evaluation methods will be introduced.

8.1.1 ALPAC’s study

In 1964, the US government funded to find the Automatic Language Processing Advi-
sory Committee (ALPAC) to investigate the progress under the roof of the computer
linguistics, including MT (Wiki 2019a). ALPAC set up the very first evaluation method
for MT, whose measurements mainly focus on intelligibility and fidelity,
using human “raters” as judges. According to their study, ALPAC published a report
(Pierce 1966) as the advisory and guidance to future development to this industry. The
intelligibility is the measurement of whether the translation is understandable
for humans or not. By contrast, the fidelity is the measurement of the information
retained for the translation sentence from the source sentence. Table 23 and 24 are the
illustrations of the intelligibility and fidelity measurements, respectively.
Interestingly, ALPAC showed a negative attitude towards MT. The experiment result
depicted that the quality of MT is low compared to human translation and the cost of
modifying the MT systems is greater than translating by a human; therefore the US
government decided to halt the MT developments which led to the drastic decline of
MT research works (Jian Zhang 2003). It was the unadvanced hardware and technolo-
gies that caused the low performance of MT at that period of time. Until 1980s, MT
researches have “revived” again with the boost of computer hardware performance and
the application of artificial intelligence in NLP.

8.1.2 DARPA’s study

In 1991, the Defense Advanced Research Projects Agency (DARPA) instigated a pro-
gram (White 1995) to focus on the quality of some mainstream MT approaches at that
time, including French-English translation, Japanese-English translation and Spanish-
English translation. DARPA’s evaluation program provided a general standard for MT to
assess in three measuring aspects based on human judgments: Adequacy, Fluency,

13
10208 S. K. Mondal et al.

and Informativeness (White 1995), which are administered in 200 sets of evalua-
tion materials.
Adequacy measurement determines how much information of a content is conveyed
to the target translation from the source text regardless of the translation quality. In the
evaluation process, the human judges are asked to compare the fragments (fragments
are syntactic constituents of the reference translations) with translations and judge them
on a scale from 1 to 5 based on conveying/carrying the information in the translations.
The higher the judgment score, the better the translation quality. Next, the result of the
whole translation set is assigned as the average number of the sum of each fragment
score.
Contradicting to Adequacy measurement, Fluency measures the translation text
on how good its target translation (e.g., English) is, determining the sentence is fluent
or well-form or not, regardless of the information correctness. The same human judges
are asked to judge a scale from 1 to 5 on a sentence-by-sentence basis and the result of
Fluency is computed in the same way as the Adequacy measurement does.
Last but not least, the Informativeness measurement measures how sufficient
information the translation system conveys so that people can obtain enough informa-
tion about its translation ability. The evaluation process is carried out by multiple choice
question test. For instance, in one of the evaluation processes, six questions were devel-
oped from the expert reference translations and each question had six possible answers
for human judges to evaluate the informativeness of a translation system, including
“none of the above” and “cannot be determined”.
We can observe that DARPA’s evaluation program along with its evaluating methodol-
ogy is significant in the MT industry, and they are still being used today.
The outputs from the methodologies of human evaluation, conventionally, are accurate.
However, the expensive cost when conducting human evaluations is hard to be overlooked.
Firstly, these evaluations are time-consuming and usually take from days to months, which
drastically declines the efficiency of the evaluation. Secondly, human evaluation lacks
objectivity. The output varies according to different testers or raters. Even for the best set
of translators in the world, they might provide different evaluation results given the same
texts, making the outputs unrepeatable (Yun Huang 2008).

8.2 Automatic evaluation methods

The pitfalls of the human evaluation restrict the development of MT. Researchers and
developers turn to automatic evaluation, to seek a better and more convenient approach
to evaluate MT systems. In 1999, DARPA funded the TIDES program, which stands for
Translingual Information Detection, Extraction, and Summarization (Agency 2003).
TIDES program is initiated to improve automatic processing on information detecting,
extracting, summarizing and translating based on massive human language data, boost-
ing the interpreting process so that people can overcome the language barrier. As a part
of TIDES program, since 2001, the evaluation activity Open Machine Translation Evalu-
ation (OpenMT) (of Standards and Technology) 2010) held by National Institute of Stand-
ards and Technology (NIST) focus on not only the human evaluation methods, but also
the automatic evaluation methods, which provides significant guidance on research efforts
and technical calibration towards MT. When it comes to the automatic evaluation involved
in OpenMT, an upgraded version of Bilingual Evaluation Understudy (BLEU) (Papineni
et al. 2002) metric is adopted. Besides, we find that George Doddington et al. proposed the

13
Machine translation and its evaluation: a study 10209

NIST metric (Doddington 2002) based on BLEU. The name comes from National Insti-
tute of Standards and Technology in US. Apart from the BLEU and NIST metrics, some
other popular automatic evaluation methods are proposed, such as METEOR (Metric for
Evaluation of Translation with Explicit ORdering) (Banerjee and Lavie 2005), ROUGE
(Recall-Oriented Understudy for Gisting Evaluation) (Lin 2004), etc. These reliable auto-
matic evaluation methodologies bring significant convenience to the MT researchers and
developers. In this section, we briefly present the these popular and commonly used evalu-
ation metrics.

8.2.1 The BLEU metric

BLEU proposed by Papineni et al. (2002) from the IBM T. J. Watson Research Center,
objective was to break the bottleneck of human evaluation methods as they possess low
efficiency and are cost expensive. To this, their aim was to create a methodology that is
friendly to developers and correlates with human evaluation tightly. In the evalution pro-
cess, BLEU measures similarity, intuitively the short sequences of words between the
MT output with professional human translation references (Papineni et al. 2002). These
sequences of words are regarded as the N-gram of words (Popović 2017). The metric also
requires two inputs: a numerical translation closeness metric, and a human translation ref-
erence corpus, respectively.
Consider the following translation outputs derived from two separate MTs, whose
source sentence is Chinese (Ma et al. 2018; Barzilay and Koehn 2004):

– Candidate 1: The new high-tech products in Guangdong exported 3.76 billion in the
first two month this year.
– Candidate 2: This year, the former two of Guangdong, the export of high-tech products
of 3.76 yi dollars.

We can clearly see that the translation quality of candidate 1 is better than candidate 2, but
for us, as humans, it is very easy to distinguish the quality level. On the other hand, how do
machine determines the translation quality? We compare these candidate translations with
three reference translations made by human translator (Barzilay and Koehn 2004; Ma et al.
2018):

– Reference 1: Guangdong’s export of new high technology products amounts to US


$3.76 billion in first two months of this year.
– Reference 2: Guangdong exports US $3.76 billion worth of high technology products in
the first two months of this year.
– Reference 3: In the first two months of this year, the export volume of new high-tech
products in Guangdong province reach 3.76 billion US dollars.

In the evaluation process, each candidate translations and references are divided into
sequences of words. For example, ‘this year’, ‘high-tech products’, ‘in the first two months’,
etc. If a candidate translation matches a considerable amount of the sequences in the refer-
ence translation, then the machine considers that it is a good translation. BLEU compares
and collects the N-grams which appear simultaneously between the candidate transla-
tion and translation reference(s). After obtaining the number of the coexisted N-gram(s),
divide them by the total word number of the candidate translation as the precision, notably

13
10210 S. K. Mondal et al.

it represents the accuracy of the given translation. In the last example, all words except
“exported” appear in the reference translation. Assume N = 1 (unigram), the BLEU preci-
sion for candidate 1 is 16/17. By contrast, the BLEU precision for candidate 2 is 10/18. On
the other hand, assume N = 2 (bigram), the BLEU precision for candidate 1 and candidate
2 are 10/16 and 3/17 respectively. This precision is closed to human evaluation, but it is
easy to fool people by some bad translations. For example (Papineni et al. 2002):

– Candidate 1: the the the the the the the.


– Reference 1: The cat is on the mat.
– Reference 2: There is a cat on the mat.

For candidate 1, the Unigram precision is 7/7, but it is an unreasonable translation. BLEU
uses the Modified Unigram Precision to alleviate this problem. Firstly, it computes the
maximum number of possible occurrences in a sentence, as follows:
Countclip = min(Count, Max_Ref _Count),

where Count denotes the occurrences of an N-gram word in the candidate translation,
and Max_Ref _Count denotes the maximum occurrences of a N-gram word in the refer-
ence translation. If Count is bigger than Max_Ref _Count , BLEU clips the extra occurrence
number(s), which means that BLEU no longer regards the extra occurrences as a legitimate
match. Next, it divides the output by the total number of the candidate word to derive the
Modified Unigram Precision. In this example, the Modified Unigram Precision is 2/7.
Similarly, the Modified N-gram Precision, pn for each target sentence is computed by:
∑ ∑
C∈Candidate n−gram∈C Countclip (n − gram)
pn = ∑ ∑ �
.
C� ∈Candidate n−gram� ∈C� Count(n − gram )

However, if a candidate translation sentence is shorter than the reference translation sen-
tence, the precision of the candidate sentence would bias to a higher score. Therefore, in
order to obtain a high score, one may deliberately use an extremely short but false candi-
date sentence to “cheat” the evaluation system. Therefore, BLEU introduces the sentence
brevity penalty ( Pb) to alleviate this problem. Eventually, the BLEU score is computed by:
(N )

BLEU = Pb ⋅ exp Wn logPn ,
n=1

where Pb is the brevity penalty. It is 1, while the length of the candidate translation c is
greater than the length of the effective reference translation r. Otherwise, Pb = e1−c∕r , if c
is less than or equal to r. Notably, N = 4 as the baseline, and the uniform weight, Wn = 1∕n
.
The BLEU metric merely compares the word similarity of the translation from MT sys-
tem with multiple reference human translations, without understanding the meaning of the
word or sentence. Notably, it considers the adequacy and fidelity in human trans-
lation. In the N-gram match process of BLEU, the unigram matches account for ade-
quacy, whereas the matches for N > 1 represent fidelity. BLEU shows great reliability
compared with the human translator. Figure 42 shows the evaluation results of BLEU, a
monolingual translator and a bilingual translator. S1, S2, S3 represent three versions of
translation given by three different MTs, respectively and H1, H2 represent two versions of

13
Machine translation and its evaluation: a study 10211

Fig. 42  Comparison between BLEU and Human Judges, Papineni et al., 2002, reproduced from the infor-
mation in Papineni et al. (2002)

Fig. 43  Comparison of the correlation of BLEU and NIST scores for four corpora, such as Chinese, French,
Japanese, and Spanish, George Doddington, 2002 Doddington (2002)

translation given by two different human translators. We can obviously notice the high cor-
relations in BLEU and the human judges with respect to the same translations.

8.2.2 The NIST metric

George Doddington proposed the NIST metric (Doddington 2002) in 2002 and it is
based on BLEU. Consequently, NIST involves the N-gram concept, but not simply sums
up the matched N-gram weights to derive the precision as BLEU. NIST introduces the

13
10212 S. K. Mondal et al.

Fig. 44  Comparison of the brevity penalty factor of BLEU and NIST, George Doddington, 2002, repro-
duced from the information in Doddington (2002)

Table 25  Comparative analysis of METEOR with BLEU and NIST, Banerjee et al., 2005, reproduced from
the information in Banerjee and Lavie (2005)
System ID BLEU NIST Precision Recall F1 Fmean METEOR

Correlation 0.817 0.892 0.752 0.941 0.948 0.952 0.964

Informativeness of each N-gram word, which considers that if an N-gram word


occurs the less frequently in the candidate translation, the more weight it will be assigned.
The heavier weight is assigned to the N-gram words that contain more information, which
means the words are rarer. The NIST score is finally computed using the arithmetic mean
of N-gram matches between the candidate translations and the reference translations
(Álvaro Rocha et al. 2018).
In Fig. 43, we show comparative analysis between BLEU and NIST of the correlation
scores for four corpora, such as Chinese, French, Japanese, Spanish. We observe that more
or less, they perform equally. Besides, we find that in the the brevity penalty analysis,
BLEU and NIST differ by a small fraction as shown in Fig. 44. Notably, NIST primarily
correlates well with human judgements.

8.2.3 The METEOR metric

METEOR proposed by Banerjee and Lavie (2005), is one of the popular metrics for
MT evaluation, like BLEU. We find that it has some added features which are not
discussed in other metrics, e.g., as stemming and synonymy matching. Note that the
authors introduce the METEOR metric to address some of the issues exist in the
NIST and BLEU metric (Banerjee and Lavie 2005; Lavie et al. 2004). For instance,

13
Table 26  Comparison of the adequacy score, Lin and Och (2004). In some cases, the ROUGE methods outperform the prominent BLEU and NIST methods
Adequacy With Case Information (Case) Lower Case (NoCase) Lower Case & Stemmed (Stem)
Method P 95%L 95%U S 95%L 95%U P 95%L 95%U S 95%L 95%U P 95%L 95%U S 95%L 95%U

BLEU1 0.86 0.83 0.89 0.80 0.71 0.90 0.87 0.84 0.90 0.76 0.67 0.89 0.91 0.89 0.93 0.85 0.76 0.95
BLEU4 0.77 0.72 0.81 0.77 0.71 0.89 0.79 0.75 0.82 0.67 0.55 0.83 0.82 0.78 0.85 0.76 0.67 0.89
BLEU12 0.66 0.60 0.72 0.53 0.44 0.65 0.72 0.57 0.81 0.65 0.25 0.88 0.72 0.58 0.81 0.66 0.28 0.88
Machine translation and its evaluation: a study

NIST 0.89 0.86 0.92 0.78 0.71 0.89 0.87 0.85 0.90 0.80 0.74 0.92 0.90 0.88 0.93 0.88 0.83 0.97
WER 0.47 0.41 0.53 0.56 0.45 0.74 0.43 0.37 0.49 0.66 0.60 0.82 0.48 0.42 0.54 0.66 0.60 0.81
PER 0.67 0.62 0.72 0.56 0.48 0.75 0.63 0.58 0.68 0.67 0.60 0.83 0.72 0.68 0.76 0.69 0.62 0.86
ROUGE-L 0.87 0.84 0.90 0.84 0.79 0.93 0.89 0.86 0.92 0.84 0.71 0.94 0.92 0.90 0.94 0.87 0.76 0.95
ROUGE-W 0.84 0.81 0.87 0.83 0.74 0.90 0.85 0.82 0.88 0.77 0.67 0.90 0.89 0.86 0.91 0.86 0.76 0.95
ROUGE-S ∗ 0.85 0.81 0.88 0.83 0.76 0.90 0.90 0.88 0.93 0.82 0.70 0.92 0.95 0.93 0.97 0.85 0.76 0.94
ROUGE-S0 0.82 0.78 0.85 0.82 0.71 0.90 0.84 0.81 0.87 0.76 0.67 0.90 0.87 0.84 0.90 0.82 0.68 0.90
ROUGE-S4 0.82 0.78 0.85 0.84 0.79 0.93 0.87 0.85 0.90 0.83 0.71 0.90 0.92 0.90 0.94 0.84 0.74 0.93
ROUGE-S9 0.84 0.80 0.87 0.84 0.79 0.92 0.89 0.86 0.92 0.84 0.76 0.93 0.94 0.92 0.96 0.84 0.76 0.94
GTM10 0.82 0.79 0.85 0.79 0.74 0.83 0.91 0.89 0.94 0.84 0.79 0.93 0.94 0.92 0.96 0.84 0.79 0.92
GTM20 0.77 0.73 0.81 0.76 0.69 0.88 0.79 0.76 0.83 0.70 0.55 0.83 0.83 0.79 0.86 0.80 0.67 0.90
GTM30 0.74 0.70 0.78 0.73 0.60 0.86 0.74 0.70 0.78 0.63 0.52 0.79 0.77 0.73 0.81 0.64 0.52 0.80
10213

13
10214

13
Table 27  Comparison of the fluency score, Lin and Och (2004). In some cases, the ROUGE methods outperform the prominent BLEU and NIST methods
Fluency With Case Information (Case) Lower Case (NoCase) Lower Case & Stemmed (Stem)
Method P 95%L 95%U S 95%L 95%U P 95%L 95%U S 95%L 95%U P 95%L 95%U S 95%L 95%U

BLEU1 0.81 0.75 0.86 0.76 0.62 0.90 0.73 0.67 0.79 0.70 0.62 0.81 0.70 0.63 0.77 0.79 0.67 0.90
BLEU4 0.86 0.81 0.90 0.74 0.62 0.86 0.83 0.78 0.88 0.68 0.60 0.81 0.83 0.78 0.88 0.70 0.62 0.81
BLEU12 0.87 0.76 0.93 0.66 0.33 0.79 0.93 0.81 0.97 0.78 0.44 0.94 0.93 0.84 0.97 0.80 0.49 0.94
NIST 0.81 0.75 0.87 0.74 0.62 0.86 0.70 0.64 0.77 0.68 0.60 0.79 0.68 0.61 0.75 0.77 0.67 0.88
WER 0.69 0.62 0.75 0.68 0.57 0.85 0.59 0.51 0.66 0.70 0.57 0.82 0.60 0.52 0.68 0.69 0.57 0.81
PER 0.79 0.74 0.85 0.67 0.57 0.82 0.68 0.60 0.73 0.69 0.60 0.81 0.70 0.63 0.76 0.65 0.57 0.79
ROUGE-L 0.83 0.77 0.88 0.80 0.67 0.90 0.76 0.69 0.82 0.79 0.64 0.90 0.73 0.66 0.80 0.78 0.67 0.90
ROUGE-W 0.85 0.80 0.90 0.79 0.63 0.90 0.78 0.73 0.84 0.72 0.62 0.83 0.77 0.71 0.83 0.78 0.67 0.90
ROUGE-S ∗ 0.84 0.78 0.89 0.79 0.62 0.90 0.80 0.74 0.86 0.77 0.64 0.90 0.78 0.71 0.84 0.79 0.69 0.90
ROUGE-S0 0.87 0.81 0.91 0.78 0.62 0.90 0.83 0.78 0.88 0.71 0.62 0.82 0.82 0.77 0.88 0.76 0.62 0.90
ROUGE-S4 0.84 0.79 0.89 0.80 0.67 0.90 0.82 0.77 0.87 0.78 0.64 0.90 0.81 0.75 0.86 0.79 0.67 0.90
ROUGE-S9 0.84 0.79 0.89 0.80 0.67 0.90 0.81 0.76 0.87 0.79 0.69 0.90 0.79 0.73 0.85 0.79 0.69 0.90
GTM10 0.73 0.66 0.79 0.76 0.60 0.87 0.71 0.64 0.78 0.80 0.67 0.90 0.66 0.58 0.74 0.80 0.64 0.90
GTM20 0.86 0.81 0.90 0.80 0.67 0.90 0.83 0.77 0.88 0.69 0.62 0.81 0.83 0.77 0.87 0.74 0.62 0.89
GTM30 0.87 0.81 0.91 0.79 0.67 0.90 0.83 0.77 0.87 0.73 0.62 0.83 0.83 0.77 0.88 0.71 0.60 0.83
S. K. Mondal et al.
Machine translation and its evaluation: a study 10215

METEOR, a F-score based metric, takes into account the following factors and asserts
it as the better metric with strong experiments support:

– Use of Recall: Not only the precision, but also the recall (frequency) is taken into
analysis which helps analyze translations and helps reflect the correlations.
– Explicit word-to-word matching and word order into the analysis: In the meas-
urement process, it explicitly takes word-to-word matching and word order into
account. Consequently, it helps observe the significance of grammaticality in the
metric and finally helps reflect the correlation in the translation quality.
– Correlation at the sentence level: It helps us correlate at the sentence level and at
the segment level. On the other hand, BLEU correlates to corpus level. Importantly,
the sentence level analysis can be more indicative - if the geometric average of one
of the n-gram components is “zero”, the BLEU scores at the sentence level are
meaningless.

As stated earlier, METEOR correlates at the sentence level, so at evaluation it first cre-
ates an alignment between the candidate sentence and the reference translation sen-
tence. The alignment is mapped between unigrams of the two sentences. Notably, cal-
culation of unigram precision, P is as in BLEU/NIST, as P = wm , where m is the
t
number of unigrams in the candidate translation that are also found in the reference
translation and wt is the number of unigrams in the candidate translation. Similarly, the
unigram recall, R is, as R = wm , where m is as in precision, and wr is the number of uni-
r
grams in the reference translation. Precision and recall are combined using the har-
monic mean, as shown.
10PR
Fmean = (1)
R + 9P
Besides, the brevity penalty, Pb is defined, as shown,
( )3
number of chunks
Pb = 0.5 × (2)
number of unigrams matched

The final score (the METEOR score) for a given segment/alignment is defined as,
M = Fmean (1 − Pb ) (3)
The brevity penalty has the effect of reducing the Fmean by up to 50% if there are no bigram
or longer matches.

In Table. 25, we show a comparative analysis of METEOR with other prominent


metrics, such as BLEU and NIST. It shows that the correlation (corpus level) can reach
to 0.964 at human judgement. On the other hand, in BLEU with the same corpus, it
is 0.817. Hence, METEOR outperforms BLEU in this regard. We observe that taking
recall into account helps improve the correlation over the BLEU/NIST algorithms.

8.2.4 The ROUGE metric

ROUGE, a set of metrics (also a software package), proposed by Lin (2004); Lin and
Hovy (2003); Lin and Och (2004), are used for the evaluation of MT and text summary.

13
10216

13
Table 28  A comparative (Pros and Cons) analysis of Automatic Machine Translation Evaluation Metrics
Metric Pros Cons

BLEU - Most commonly used metric with its simplicity. - Only takes precision into account. Informativeness of N-gram word is not taken
into consideration.
- Maintain high reliability compared with human evaluation. - Recall (frequency) is not considered, as it helps evaluate the translation in a
better way.
- Stemming and synonymy matching helps define better metric, but not taken
into consideration.
- Assigns equal weight (coverage) to all N-grams, it may cause poor evaluation.
NIST - Similar like BLEU, added feature is the informativeness that helps boost - Like BLEU, only takes precision into account.
metric formulation.
- NIST gives more importance to the less frequent N-grams which help opti- - Recall is not considered.
mize evaluation.
- Maintain high reliability compared with human evaluation. - Stemming and synonymy matching are not taken into consideration.
METEOR - F-score based metric, so integrating both precision and recall. Reflects better - Simple but naive. Further enhancement is required.
evaluation.
- Flexible matching with stemming and synonyms. - N-grams are not taken into analysis, only unigrams.
ROUGE - Can work with a single reference or multiple references. - Flexible with string-to-string matching only, however, synonymy and para-
- Can achieve comparable performance. phrasing matching not taken into analysis.
- More familiar for text summary evaluation.
S. K. Mondal et al.
Machine translation and its evaluation: a study 10217

The metrics perform a comparative analysis of automatically generated translation or


summary against a reference (or a set of references). Under the umbrella of ROUGE, we
have the evaluation metrics, such as, ROUGE-N (ROUGE-1, and ROUGE-2), ROUGE-
L, ROUGE-W, ROUGE-S, ROUGE-SU, and ROUGE-U. In the context of MT evalua-
tion we have three metrics, ROUGE-L, ROUGE-W, ROUGE-S under two methods.
The first method, (ROUGE-L and ROUGE-W) identifies the Longest Common Sub-
sequence (LCS) n-grams between a candidate and a reference. For this, it is named as
ROUGE-L. Notably, ROUGE-L is pretty good with a bunch of nice features, such as,
it can work with a single reference, multiple references, and achieves comparable per-
formance as shown in Table 26 and 27. However, the issue with the approach is that
it is not able to differentiate LCSes with different spatial relations in their embedding
matrix sequences. To address this, ROUGE-W comes into the place, called as Weighted
Longest Common Subsequence (WLCS). In particular, it prefers consecutive LCSes
to address the aforementioned issue (Lin and Och 2004). On the other hand, second
method (ROUGE-S), performs skip-bigram (any pair of words in a sentence in the
order) matching instead of strict n-grams. It works based on the co-occurrence statistics
(Lin and Och 2004). Note that ROUGE-S has a set of variants, which vary with skip
distance limit, dskip . For example, ROUGE-S0, ROUGE-S4, and ROUGE-S9 are of 0, 4,
and 9 skip distant limit. Notably, ROUGE-S∗ means variable length skip distance limit.
We observe that all the ROUGE methods correlate well with human judgments in
terms of adequacy and fluency (Lin and Och 2004). In Table 26 and 27, we show a com-
parative analysis of ROUGE metrics with other prominent MT metrics. In particular,
Table. 26 shows the comparative analysis of adequacy score and Table 27 shows the
analysis of fluency score. We observe that in some cases, the ROUGE methods outper-
form the prominent BLEU and NIST metrics.

8.2.5 Discussion

We come to know about the limitations of BLEU metric. Besides, we come to know that
most of the limitations are addressed by other metrics, such as NIST, METEOR, ROUGE,
and so on. We observe that the NIST, METEOR, ROUGE metrics help us perform auto-
matic MT evaluation in many better ways. However, BLEU is the basic metric used mostly
by the community. Above all, we can say that to truly assess the quality of an MT system,
we should not trust a single method. On the other hand, as a best practice, we should use
specific metrics related to the specific use cases.
Notably, to be quickly get introduced to the pros and cons of the MT evaluation metrics
and their comparative perspective, we present a summary of the illustrated MT evaluation
metrics in Table 28.

9 Conclusion and future work

MT has become an urgent need of people in the 21st century, who face different languages
in their daily lives. In this paper, we conduct a comprehensive study on MT from two
major MT schemes: SMT and NMT. We first introduce the SMT and its following word-
based, syntax-based, phrase-based models, and the decoding process, which illustrates how
a translation comes into being through an SMT system. Next, we introduce the NMT’s
encoder-decoder architecture (i.e., sequence-to-sequence) with RNN and its upgraded

13
10218 S. K. Mondal et al.

version: the attention mechanism. We then demonstrate the NMT’s Transformer model and
observe that Transformer with self-attention and GNMT with attention optimize transla-
tion quality by a large margin and outperform all the prominent models. Finally, we intro-
duce both common human and automatic NT evaluation methods. Under the automatic
evaluation method, BLEU has been the most commonly used. However, to have a valid
evaluation, it is suggested to adopt a specific method related to the specific use cases or to
adopt a viable combination of methods.
In summary, we must realize that even though MT has brought us massive convenience
in many aspects, it still has defects that require developers and engineers to resolve. MT is
expected to be better developed over time through novel AI-based approaches.
Acknowledgements The authors would like to thank the anonymous reviewers for their quality reviews
and suggestions. This work was supported in part by The Science and Technology Development Fund of
Macao, Macao SAR, China under Grant 0033/2022/ITP and in part by The Faculty Research Grant Projects
of Macau University of Science and Technology, Macao SAR, China under Grant FRG-22-020-FI.

References
ACL (2022) ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION. Available
at https://​www.​statmt.​org/​wmt14/​trans​lation-​task.​html. Accessed 05 Apr 2022
ACL 2015 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (2022) https://​www.​
statmt.​org/​wmt15/​trans​lation-​task.​html. Accessed 05 Apr 2022
Abid A, Farooqi M, Zou J (2021) Persistent anti-muslim bias in large language models. In: Proceedings of
the 2021 AAAI/ACM conference on AI, Ethics, and Society, pp 298–306
Agency DARP (2003) Program Translingual Information Detection, Extraction and Summarization. http://​
www.​darpa.​mil/​ipto/​progr​ams/​tides/. Accessed 08 Apr 2022
Ahmadnia B, Dorr BJ (2019) Augmenting neural machine translation through round-trip training approach.
Open Comput Sci 9(1):268–278
Ahmadnia B, Serrano J, Haffari G (2017) Persian-Spanish low-resource statistical machine translation
through English as pivot language. Proceedings of the International Conference Recent Advances
in Natural Language Processing, RANLP 2017:24–30
Ahmadnia B, Haffari G, Serrano J (2019) Round-trip training approach for bilingually low-resource sta-
tistical machine translation systems. Int J Artif Intell 17(1):167–185
Ahmadnia B, Dorr BJ, Kordjamshidi P (2020) Knowledge graphs effectiveness in neural machine trans-
lation improvement. Comput Sci 21:299–318
Ahmed A, Hanneman G (2005) Syntax-based statistical machine translation: a review. Comput Linguist
1:1
Álvaro Rocha, Adeli H, Reis LP, Costanzo S (2018) Trends and advances in information systems and
technologies, vol 2. Springer, Berlin
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of
open data. In: The semantic web, Springer, pp 722–735
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and trans-
late. arXiv preprint arXiv:​1409.​0473
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation
with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summarization, pp 65–72
Bapna A, Firat O, Wang P, Macherey W, Cheng Y, Cao Y (2022) Multilingual Mix: Example Interpola-
tion Improves Multilingual Neural Machine Translation. In: ACL 2022
Barzilay R, Koehn P (2004) Natural Language Processing, Fall 2004, Machine Translation I, Lecture 20.
CS an AI Lab, MIT, New York
Bender EM (2019) A typology of ethical risks in language technology with an eye towards where trans-
parent documentation can help. In: Future of artificial intelligence: language, ethics, technology
workshop
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in
speech processing, Springer, pp 1–4

13
Machine translation and its evaluation: a study 10219

Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation
quality: a case study. arXiv preprint arXiv:​1608.​04631
Besacier L, Blanchon H (2017) Comparing statistical machine translation and neural machine translation
performances. https://​evalu​erlata.​hypot​heses.​org/​files/​2017/​07/​Laure​nt-​Besac​ier-​NMTvs​SMT.​pdf,
laboratoire LIG, Université Grenoble Alpes, France
Bick E (2007) Dan2eng: wide-coverage Danish-English machine translation. In: Proceedings of Machine
Translation Summit XI: Papers
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph
database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD interna-
tional conference on management of data, pp 1247–1250
Bouchard G (2007) Efficient bounds for the softmax function and applications to approximate inference
in hybrid models. In: NIPS 2007 workshop for approximate Bayesian inference in continuous/
hybrid systems
Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine transla-
tion: parameter estimation. Comput Linguist 19(2):263–311
Buck C, Heafield K, Van Ooyen B (2014) N-gram counts and language models from the common crawl.
In: Proceedings of the Ninth International Conference on Language Resources and Evaluation
(LREC’14), pp 3579–3584
Canfora C, Ottmann A (2020) Risks in neural machine translation. Trans Spaces 9(1):58–77
Casas N, Costa-jussà MR, Fonollosa JA, Alonso JA, Fanlo R (2021) Linguistic knowledge-based vocab-
ularies for Neural Machine Translation. Nat Lang Eng 27(4):485–506
Caswell I, Liang B (2022) Recent Advances in Google Translate. Tutorial https://​ai.​googl​eblog.​com/​
2020/​06/​recent-​advan​ces-​in-​google-​trans​late.​html. Accessed 05 Apr 2022
Chah N (2018) OK Google, what is your ontology? Or: exploring freebase classification to understand
Google’s Knowledge Graph. arXiv preprint arXiv:​1805.​03885
Chapelle O, Weston J, Bottou L, Vapnik V (2000) Vicinal risk minimization. Adv Neural Inf Process
Syst 13:1
Cheng Y (2019) Joint training for pivot-based neural machine translation. In: Joint Training for Neural
Machine Translation, Springer, pp 41–54
Cheng Y, Jiang L, Macherey W (2019) Robust Neural Machine Translation with Doubly Adversarial
Inputs. In: ACL
Cheng Y, Jiang L, Macherey W, Eisenstein J (2020) AdvAug: Robust Adversarial Augmentation for
Neural Machine Translation. In: ACL, https://​arxiv.​org/​abs/​2006.​11834
Chiang D, Knight K (2006) An introduction to synchronous grammars. Tutorial https://​www3.​nd.​edu/​
~dchia​ng/​papers/​synch​tut.​pdf. Accessed 4 Jan 2022
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014a) Learning
phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint
arXiv:​1406.​1078
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014b) On the properties of neural machine transla-
tion: encoder-decoder approaches. arXiv preprint arXiv:​1409.​1259
Colah (2015) Understanding LSTM networks. http://​colah.​github.​io/​posts/​2015-​08-​Under​stand​ing-​LSTMs/.
Accessed 08 Apr 2022
Common Crawl corpus (2022) https://​commo​ncrawl.​org/. Accessed 30 Sep 2022
Corbí-Bellot AM, Forcada ML, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F,
Alegria I, Mayor A, Sarasola K (2005) An open-source shallow-transfer machine translation engine
for the Romance languages of Spain. In: Proceedings of the 10th EAMT conference: practical appli-
cations of machine translation
Costa-Jussa MR, Fonollosa JA (2015) Latest trends in hybrid machine translation and its applications. Com-
put Speech Lang 32(1):3–10
Council NR (1416) Language and Machines: Computers in Translation and Linguistics; a Report. National
Research Council, https://​books.​google.​com/​books?​hl=​zh-​CN &​lr=​ &​id=​Q0ErA​AAAYA​AJ &​oi=​
fnd &​pg=​PA1 &​dq=​Langu​ages+​and+​machi​nes:+​compu​ters+​in+​trans​lation+​and+​lingu​istic​s &​ots=​
Ngyta​fcXa-​ &​sig=​Hc733​OYAAT​89yd4U-​3xLdh​77gEM#v=​onepa​ge &​q &f=​false. Accessed 08 Apr
2022
Cui Y, Chen Z, Wei S, Wang S, Liu T, Hu G (2016) Attention-over-attention neural networks for reading
comprehension. arXiv preprint arXiv:​1607.​04423
Currey A, Heafield K (2019) Zero-resource neural machine translation with monolingual pivot data. In: Pro-
ceedings of the 3rd workshop on neural generation and translation, pp 99–107

13
10220 S. K. Mondal et al.

Dabre R, Cromieres F, Kurohashi S, Bhattacharyya P (2015) Leveraging small multilingual corpora for smt
using many pivot languages. In: Proceedings of the 2015 conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, pp 1192–1202
Dabre R, Imankulova A, Kaneko M, Chakrabarty A (2021) Simultaneous multi-pivot neural machine trans-
lation. arXiv preprint arXiv:​2104.​07410
Dajun Z, Yun W (2015) Corpus-based machine translation: Its current development and perspectives. In:
International Forum of Teaching and Studies, American Scholars Press, Inc., vol 11, p 90
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algo-
rithm. J R Stat Soc B 39(1):1–22
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:​1810.​04805
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence
statistics. In: Proceedings of the second international conference on Human Language Technology
Research, Morgan Kaufmann Publishers Inc., pp 138–145
Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. arXiv preprint arXiv:​
1808.​09381
Esplà-Gomis M, Forcada ML, Ramírez-Sánchez G, Hoang H (2019) ParaCrawl: Web-scale parallel corpora
for the languages of the EU. In: Proceedings of Machine Translation Summit XVII: Translator, Pro-
ject and User Tracks, pp 118–119
Europarl (2022) Europarl: European Parliament Proceedings Parallel Corpus. https://​www.​statmt.​org/​europ​
arl/. Accessed 30 Sep 2022
Forcada ML, Ginestí-Rosell M, Nordfalk J, O’Regan J, Ortiz-Rojas S, Pérez-Ortiz JA, Sánchez-Martínez F,
Ramírez-Sánchez G, Tyers FM (2011) Apertium: a free/open-source platform for rule-based machine
translation. Mach Trans 25(2):127–144
Freitag M, Torres DV, Grangier D, Cherry C, Foster G (2022) A Natural Diet: Towards Improving Natural-
ness of Machine Translation Output. In: Proceedings of the 60th Annual Meeting of the Association
for Computational Linguistics, Online
Furuse O, Iida H (1992) An example-based method for transfer-driven machine translation. TMI
1992:139–150
Färber M, Ell B, Menne C, Rettinger A (2015) A comparative survey of dbpedia, freebase, opencyc,
wikidata, and yago. Semantic Web J 1(1):1–5
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learn-
ing. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR.
org, pp 1243–1252
Gheini M, Ren X, May J (2021a) Cross-attention is all you need: adapting pretrained transformers for
machine translation. In: Proceedings of the 2021 conference on Empirical Methods in Natural
Language Processing, pp 1754–1765
Gheini M, Ren X, May J (2021b) On the strengths of cross-attention in pretrained transformers for
machine translation
Gispert Ramis A (2007) Introducing linguistic knowledge into statistical machine translation. Universi-
tat Politècnica de Catalunya
Graves A (2012) Long short-term memory. In: Supervised sequence labelling with recurrent neural net-
works. Springer, Berlin. pp 37–45
Guo Z, Huang Z, Zhu KQ, Chen G, Zhang K, Chen B, Huang F (2021) Automatically paraphrasing via
sentence reconstruction and round-trip translation. IJCAI
Habash N, Zalmout N, Taji D, Hoang H, Alzate M (2017) A Parallel Corpus for Evaluating Machine
Translation between Arabic and European Languages. In: Proceedings of the 15th conference of
the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers,
Association for Computational Linguistics, Valencia, Spain, pp 235–241. https://​aclan​tholo​gy.​org/​
E17-​2038
Hardmeier C (2012) Discourse in statistical machine translation. a survey and a case study. Discours
Revue de linguistique, psycholinguistique et informatique A journal of linguistics, psycholinguis-
tics and computational linguistics (11)
He D, Xia Y, Qin T, Wang L, Yu N, Liu TY, Ma WY (2016) Dual learning for machine translation. Adv
Neural Inf Process Syst 29
Hecht-Nielsen R (1992) Theory of the backpropagation neural network. In: Neural networks for percep-
tion, Elsevier, pp 65–93
Hill DC, Gombay C, Sanchez O, Woappi B, Romero Vélez AS, Davidson S, Richardson EZ (2022) Lost
in machine translation: The promises and pitfalls of machine translation for multilingual group
work in global health education. Discov Educ 1(1):1–5

13
Machine translation and its evaluation: a study 10221

Horváth I (2022) AI in interpreting: ethical considerations. Across Lang Cult 23(1):1–13


Huck M, Birch A (2015) The Edinburgh machine translation systems for IWSLT 2015. In: Proceedings
of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign
Hull DA, Grefenstette G (1996) Querying across languages: a dictionary-based approach to multilingual
information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on
Research and development in information retrieval, ACM, pp 49–57
IWSLT (2018) International Workshop on Spoken Language Translation. https://​works​hop20​18.​iwslt.​
org/. Accessed 08 Apr 2022
Imankulova A, Sato T, Komachi M (2019) Filtered pseudo-parallel corpus improves low-resource neural
machine translation. ACM Trans Asian Low-Resour Lang Inf Process 19(2):1–16
Islam M, Anik M, Hoque S, Islam A et al (2021) Towards achieving a delicate blending between rule-
based translator and neural machine translator. Neural Comput Appl 33(18):12141–12167
Jean S, Cho K, Memisevic R, Bengio Y (2015) On Using Very Large Target Vocabulary for Neural
Machine Translation. In: Proceedings of the 53rd annual meeting of the Association for Compu-
tational Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pp 1–10
Jehl L, Simianer P, HIrschler J, Riezler S (2015) The Heidelberg university English-German translation
system for IWSLT 2015. In: Proceedings of the 12th International Workshop on Spoken Language
Translation: Evaluation Campaign
Jian Zhang MZ Ji Wu (2003) The improvement of automatic machine translation evaluation. J Chin Inf
Process 17(6):2, http://​jcip.​cipsc.​org.​cn/​CN/​abstr​act/​artic​le_​1823.​shtml
Johnson M (2020) A scalable approach to reducing gender bias in Google translate, https://​ai.​googl​
eblog.​com/​2020/​04/a-​scala​ble-​appro​ach-​to-​reduc​ing-​gender.​html. Google AI Blog Accessed on
30 Sep 2022
Jurafsky D, Martin JH (2022) Speech and language processing: an introduction to natural language pro-
cessing, computational linguistics, and speech recognition, chapter 10: machine translation and
encoder-decoder models, 10.9 Bias and Ethical Issues (3rd (draft) ed.). Accessed on 08 Apr 2022
Kalchbrenner N, Espeholt L, Simonyan K, Oord Avd, Graves A, Kavukcuoglu K (2016) Neural machine
translation in linear time. arXiv preprint arXiv:​1610.​10099
Kaljahi RSZ, Rubino R, Roturier J, Foster J, Park BB (2012) A detailed analysis of phrase-based and
syntax-based machine translation: The search for systematic differences. In: Proceedings of
AMTA
Karim R (2019) Illustrated attention. https://​towar​dsdat​ascie​nce.​com/​attn-​illus​trated-​atten​tion-​5ec4a​d276e​
e3. Accessed 08 Apr 2022
Kenny D, Moorkens J, Do Carmo F (2020) Fair MT: towards ethical, sustainable machine translation. Trans
Spaces 9(1):1–11
Kharitonova K (2021) Linguistics4fairness: neutralizing Gender Bias in neural machine translation by intro-
ducing linguistic knowledge. Master’s thesis, Universitat Politècnica de Catalunya
Klakow D, Peters J (2002) Testing the correlation of word error rate and perplexity. Speech Commun
38(1–2):19–28
Klein G, Kim Y, Deng Y, Senellart J, Rush AM (2017) Opennmt: Open-source toolkit for neural machine
translation. arXiv preprint arXiv:​1701.​02810
Ko WJ, El-Kishky A, Renduchintala A, Chaudhary V, Goyal N, Guzmán F, Fung P, Koehn P, Diab M
(2021) Adapting high-resource NMT models to translate low-resource related languages without par-
allel data. arXiv preprint arXiv:​2105.​15071
Koehn P (2009) Statistical machine translation. Cambridge University Press
Koehn P, Chiang D (2019) Special interest group of machine translation. http://​www.​sigmt.​org. Accessed
08 Apr 2022
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens
R et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the
45th annual meeting of the association for computational linguistics companion volume proceedings
of the demo and poster sessions, pp 177–180
Koehn P (2009) Chapter 4 Word-based models - Statistical Machine Translation. Cambridge University
Press, Cambridge. Accessed 08 Apr 2022
Koehn P (2009) Statistical Machine Translation Lecture 5 Syntax-Based Models. Cambridge University
Press, Cambridge. Accessed 08 Apr 2022
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 Confer-
ence of the North American Chapter of the Association for Computational Linguistics on Human
Language Technology-Volume 1, Association for Computational Linguistics, pp 48–54

13
10222 S. K. Mondal et al.

Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Proceedings of machine
translation summit x: papers, pp 79–86
Koehn P (2009) Statistical Machine Translation Lecture 6 Decoding. Cambridge University Press, Cam-
bridge. Accessed 08 Apr 2022
Kussul E, Baidyk T, Kasatkina L, Lukovich V (2001) Rosenblatt perceptrons for handwritten digit rec-
ognition. In: IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.
01CH37222), IEEE, vol 2, pp 1516–1520
Labaka G, España-Bonet C, Màrquez L, Sarasola K (2014) A hybrid machine translation architecture guided
by syntax. Mach Trans 28(2):91–125
Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:​1901.​07291
Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In:
Conference of the Association for Machine Translation in the Americas, Springer, pp 134–143
Li Y, Xiong D, Zhang M (2018) A survey of neural machine translation. Chinese Journal of Computers
12:2734
Li Q, Zhang X, Xiong J, Hwu Wm, Chen D (2019) Implementing neural machine translation with bi-direc-
tional GRU and attention mechanism on FPGAs using HLS. In: Proceedings of the 24th Asia and
South Pacific Design Automation Conference, pp 693–698
Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches
out, pp 74–81
Lin CY, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Pro-
ceedings of the 2003 human language technology conference of the North American chapter of the
association for computational linguistics, pp 150–157
Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subse-
quence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics (ACL-04), pp 605–612
Liu J (2020) Comparing and analyzing cohesive devices of SMT and NMT from Chinese to English: a dia-
chronic approach. Open J Mod Linguist 10(6):765–772
Liu CH, Silva CC, Wang L, Way A (2018) Pivot machine translation using chinese as pivot language. In:
China workshop on machine translation. Springer, Berlin. pp 74–85
Liu X, Wang Y, Wang X, Xu H, Li C, Xin X (2021) Bi-directional gated recurrent unit neural network based
nonlinear equalizer for coherent optical communication system. Opt Express 29(4):5923–5933
Lu Y, Zhang J, Zong C (2018) Exploiting knowledge graph in neural machine translation. In: China work-
shop on machine translation. Springer, Berlin. pp 27–38
Luccioni A, Viviano J (2021) What’s in the box? an analysis of undesirable content in the Common Crawl
corpus. In: Proceedings of the 59th annual meeting of the Association for Computational Linguis-
tics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short
Papers), pp 182–189
Luong MT, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in
neural machine translation. arXiv preprint arXiv:​1410.​8206
Luong MT, Pham H, Manning CD (2015) Effective Approaches to Attention-based Neural Machine
Translation. In: Proceedings of the 2015 conference on Empirical Methods in Natural Language
Processing, pp 1412–1421
Ma S, Sun X, Wang Y, Lin J (2018) Bag-of-words as target for neural machine translation. arXiv pre-
print arXiv:​1805.​04871
Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of English: the
penn treebank. Comput Linguist 19(2):313–330
Mehandru N, Robertson S, Salehi N (2022) Reliable and Safe Use of Machine Translation in Medical
Settings. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transpar-
ency (Seoul, South Korea)(FAccT’22). Association for Computing Machinery, New York, NY,
USA
Meng F, Lu Z, Li H, Liu Q (2016) Interactive attention for neural machine translation. arXiv preprint
arXiv:​1610.​05011
Moon TK (1996) The expectation-maximization algorithm. IEEE Signal Process Mag 13(6):47–60
Moon J, Cho H, Park EL (2020) Revisiting round-trip translation for quality estimation. arXiv preprint
arXiv:​2004.​13937
Moorkens J (2022) Ethics and machine translation. Machine translation for everyone: empowering users
in the age of artificial intelligence 18:121
Moussallem D, Ngonga Ngomo AC, Buitelaar P, Arcan M (2019) Utilizing knowledge graphs for neural
machine translation augmentation. In: Proceedings of the 10th international conference on knowl-
edge capture, pp 139–146

13
Machine translation and its evaluation: a study 10223

Mueller V (2022) An introduction to synchronous grammars. Tutorial https://​medium.​com/​towar​ds-​data-​


scien​ce/​atten​tion-​please-​85bd0​abac41. Accessed 05 Apr 2022
Nießen S, Ney H (2000) Improving SMT quality with morpho-syntactic analysis. In: COLING 2000
Volume 2: the 18th international conference on computational linguistics
Nyberg EH, Mitamura T (1992) The KANT system: Fast, accurate, high-quality translation in practical
domains. In: Proceedings of the 14th conference on Computational linguistics-Volume 3, Associa-
tion for Computational Linguistics, pp 1069–1073
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Lin-
guist 29(1):19–51
Och FJ, Tillmann C, Ney H (1999) Improved alignment models for statistical machine translation. In:
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very
Large Corpora
Och FJ, Ueffing N, Ney H (2001) An efficient A* search algorithm for statistical machine translation. In:
Proceedings of the workshop on Data-driven methods in machine translation-Volume 14, Associa-
tion for Computational Linguistics, pp 1–8
Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K
et al (2003) Syntax for statistical machine translation. In: Johns Hopkins University 2003 Sum-
mer Workshop on Language Engineering, Center for Language and Speech Processing, Baltimore,
MD, Tech. Rep
Of Standards NNI, Technology) (2010) Open Machine Translation Evaluation. https://​www.​nist.​gov/​itl/​
iad/​mig/​open-​machi​ne-​trans​lation-​evalu​ation. Accessed 08 Apr 2022
Ortega JE, Castro Mamani R, Cho K (2020) Neural machine translation with a polysynthetic low
resource language. Mach Trans 34(4):325–346
Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets, and classification. IEEE Trans Neural Netw
3(5):683–697
Papadimitriou CH (2003) Computational complexity. Wiley, New York
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine
translation. In: Proceedings of the 40th annual meeting on association for computational linguis-
tics, Association for Computational Linguistics, pp 311–318
ParaCrwal (2022) ParaCrawl: Web-scale parallel corpora for the languages of the EU. https://​parac​rawl.​
eu/. Accessed 30 Sep 2022
Pierce J (1966) Language and machines: computers in translation and linguistics; a Report. National
Research Council
Popović M (2017) chrF++: words helping character n-grams. In: Proceedings of the second conference on
machine translation, pp 612–618
Prates MO, Avelar PH, Lamb LC (2020) Assessing gender bias in machine translation: a case study with
google translate. Neural Comput Appl 32(10):6363–6381
Pryzant R, Chung Y, Jurafsky D, Britz D (2018) JESC: Japanese-English Subtitle Corpus. In: Proceedings
of the eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the
limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:1–67
Ramnath S, Johnson M, Gupta A, Raghuveer A (2021) HintedBT: Augmenting Back-Translation with Qual-
ity and Transliteration Hints. In: EMNLP 2021
Ranathunga S, Lee ESA, Skenduli MP, Shekhar R, Alam M, Kaur R (2021) Neural machine translation for
low-resource languages: a survey. arXiv preprint arXiv:​2106.​15115
Ravikumar D, Kodge S, Garg I, Roy K (2020) Exploring Vicinal Risk Minimization for Lightweight Out-of-
Distribution Detection. arXiv preprint arXiv:​2012.​08398
Rebele T, Suchanek F, Hoffart J, Biega J, Kuzey E, Weikum G (2016) YAGO: A multilingual knowledge
base from wikipedia, wordnet, and geonames. In: International semantic web conference, Springer, pp
177–185
Rescigno AA, Vanmassenhove E, Monti J, Way A (2020) A case study of natural gender phenomena in
translation. A comparison of Google Translate, Bing Microsoft Translator and DeepL for English to
Italian, French and Spanish. In: CLiC-it
Richards C, Bouman WP, Seal L, Barker MJ, Nieder TO, T’Sjoen G (2016) Non-binary or genderqueer gen-
ders. Int Rev Psychiatry 28(1):95–102
Ringler D, Paulheim H (2017) One knowledge graph to rule them all? Analyzing the differences between
DBpedia, YAGO, Wikidata & co. In: Joint German/Austrian Conference on Artificial Intelligence
(Künstliche Intelligenz), Springer, pp 366–372
Rocktäschel T, Grefenstette E, Hermann KM, Kočiskỳ T, Blunsom P (2015) Reasoning about entailment
with neural attention. arXiv preprint arXiv:​1509.​06664

13
10224 S. K. Mondal et al.

Rothman D (2021) Transformers for Natural Language Processing: build innovative deep neural network
architectures for NLP with Python, PyTorch, TensorFlow, BERT, RoBERTa, and more. Packt Pub-
lishing Ltd
Sakamoto A (2019) Unintended consequences of translation technologies: from project managers’ perspec-
tives. Perspectives 27(1):58–73
Sara Stymne GT (2022) Phrase Based Machine Translation. Tutorial https://​cl.​lingf​i l.​uu.​se/​kurs/​MT19/​
slides/​pbsmt.​pdf. Accessed 05 Apr 2022
Saunders D, Sallis R, Byrne B (2020) Neural machine translation doesn’t translate gender coreference right
unless you make it. In: Proceedings of the second workshop on gender bias in natural language pro-
cessing, pp 35–43
Savoldi B, Gaido M, Bentivogli L, Negri M, Turchi M (2021) Gender bias in machine translation. Trans
Assoc Comput Linguist 9:845–874
Schiebinger L (2014) Scientific research must take gender into account. Nature 507(7490):9–9
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process
45(11):2673–2681
Schwenk H, Chaudhary V, Sun S, Gong H, Guzmán F (2019) Wikimatrix: Mining 135m parallel sentences
in 1620 language pairs from wikipedia. arXiv preprint arXiv:​1907.​05791
Sennrich R, Haddow B, Birch A (2015) Improving neural machine translation models with monolingual
data. arXiv preprint arXiv:​1511.​06709
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv
preprint arXiv:​1508.​07909
Shannon CE, Weaver W (1949) The mathematical theory of information. University of Illinois Press
Sharma S, Sharma S, Athaiya A (2017) Activation functions in neural networks towards data science
6(12):310–316
Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J (2017) Outrageously large neural
networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:​1701.​06538
Shieber SM, Schabes Y (1990) Synchronous tree-adjoining grammars. In: Proceedings of the 13th confer-
ence on Computational linguistics-Volume 3, pp 253–258
Shieber SM, Schabes Y (1991) Generation and synchronous tree-adjoining grammars. Comput Intell
7(4):220–228
Singh SP, Kumar A, Darbari H, Singh L, Rastogi A, Jain S (2017) Machine translation using deep learn-
ing: An overview. In: 2017 international conference on computer, communications and electronics
(comptelix), IEEE, pp 162–167
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with tar-
geted human annotation. In: Proceedings of association for machine translation in the Americas,
vol 200
Somers H (2005) Round-trip translation: What is it good for? Proc Austral Lang Technol Workshop
2005:127–133
Stanovsky G, Smith NA, Zettlemoyer L (2019) Evaluating gender bias in machine translation. In: Pro-
ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Asso-
ciation for Computational Linguistics, Florence, Italy, pp 1679–1684, https://​doi.​org/​10.​18653/​v1/​
P19-​1164, https://​aclan​tholo​gy.​org/​P19-​1164
Stasimioti M, Sosoni V, Kermanidis KL, Mouratidis D (2020) Machine Translation Quality: a compara-
tive evaluation of SMT, NMT and tailored-NMT outputs. In: Proceedings of the 22nd annual con-
ference of the European Association for Machine Translation, pp 441–450
Staudemeyer RC, Morris ER (2019) Understanding LSTM–a tutorial into long short-term memory
recurrent neural networks. arXiv preprint arXiv:​1909.​09586
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: A
multilingual aligned parallel corpus with 20+ languages. arXiv preprint arXiv:​cs/​06090​58
Strubell E, Ganesh A, McCallum A (2019) Energy and Policy Considerations for Deep Learning in NLP.
In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp
3645–3650
Sun T, Gaut A, Tang S, Huang Y, ElSherief M, Zhao J, Mirza D, Belding E, Chang KW, Wang WY
(2019) Mitigating gender bias in natural language processing: literature review. In: Proceedings of
the 57th annual meeting of the Association for Computational Linguistics, pp 1630–1640
Sun T, Shah A, Webster K, Johnson M (eds) (2021) They, them, theirs: rewriting with gender-neutral
English, arXiv:​2102.​06788
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances
in neural information processing systems, pp 3104–3112
Tixier AJP (2018) Notes on deep learning for nlp. arXiv preprint arXiv:​1808.​09772

13
Machine translation and its evaluation: a study 10225

Toma P (1977) Systran as a multilingual machine translation system. In: Proceedings of the Third
European Congress on Information Systems and Networks, overcoming the language barrier, pp
569–581
Tomalin M, Byrne B, Concannon S, Saunders D, Ullmann S (2021) The practical ethics of bias reduc-
tion in machine translation: why domain adaptation is better than data debiasing. Ethics Inf Tech-
nol 23(3):419–433
Ullmann S (2022) Gender bias in machine translation systems. In: Artificial Intelligence and Its Discon-
tents, Springer, pp 123–144
United Nations Parallel Corpus (2022) https://​confe​rences.​unite.​un.​org/​uncor​pus. Accessed 30 Sep 2022
van Wynsberghe A (2021) Sustainable AI: AI for sustainability and the sustainability of AI. AI Ethics
(2021). https://​doi.​org/​10.​1007/​s43681-​021-​00043-6
Vandeghinste V, Martens S, Kotzé G, Tiedemann J, Van den Bogaert J, De Smet K, Van Eynde F, Van
Noord G (2013) Parse and corpus-based machine translation. Essential Speech and Language
Technology for Dutch. Springer, Berlin, pp 305–319
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017)
Attention is all you need. Adv Neural Inf Process Syst 30:1
Vieira LN, O’Hagan M, O’Sullivan C (2021) Understanding the societal impacts of machine translation: a
critical review of the literature on medical and legal use cases. Inf Commun Soc 24(11):1515–1532
Wahler ME (2018) A word is worth a thousand words: legal implications of relying on machine transla-
tion technology. Stetson L Rev 48:109
Wang S (2005) Computers and translation: a translator’s guide. Language 81(2):544–545
Wang W, Pan SJ, Dahlmeier D, Xiao X (2017) Coupled multi-layer attentions for co-extraction of aspect
and opinion terms. In: Thirty-First AAAI Conference on Artificial Intelligence
Weidinger L, Mellor J, Rauh M, Griffin C, Uesato J, Huang PS, Cheng M, Glaese M, Balle B, Kasirzadeh A
et al (2021) Ethical and social risks of harm from language models. arXiv preprint arXiv:​2112.​04359
White JS (1995) Approaches to black box MT evaluation. Citeseer, proceedings of machine translation
summit V, vol 10
Wiki (2019) Automatic Language Processing Advisory Committee. https://​en.​wikip​edia.​org/​wiki/​
ALPAC. Accessed 08 Apr 2022
Wiki (2019) Long-short term memory. https://​en.​wikip​edia.​org/​wiki/​Long_​short-​term_​memory.
Accessed 08 Apr 2022
Wikimatrix (2022) Mining 135m parallel sentences in 1620 language pairs from wikipedia. https://​
github.​com/​faceb​ookre​search/​LASER/​tree/​main/​tasks/​WikiM​atrix. Accessed 30 Sep 2022
Wong YW, Mooney R (2007) Learning synchronous grammars for semantic parsing with lambda calcu-
lus. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp
960–967
Wu D (1997) Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput
Linguist 23(3):377–403
Wu H, Wang H (2007) Pivot language approach for phrase-based statistical machine translation. Mach Trans
21(3):165–181
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al
(2016) Google’s neural machine translation system: Bridging the gap between human and machine
translation. arXiv preprint arXiv:​1609.​08144
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and
tell: Neural image caption generation with visual attention. In: International conference on machine
learning, pp 2048–2057
Yamada M (2019) The impact of Google Neural Machine Translation on Post-editing by student translators.
J Special Trans 31:87–106
Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Proceedings of the 39th annual
meeting of the Association for Computational Linguistics, pp 523–530
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document
classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies, pp 1480–1489
Yang B, Wang L, Wong DF, Chao LS, Tu Z (2019) Convolutional Self-Attention Networks. In: Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp 4040–4045
Yun Huang QL Yang Liu (2008) Introduction on machine translation evaluation
Zakir HM, Nagoor MS (2017) A brief study of challenges in machine translation. Int J Comput Sci Issues
14(2):54

13
10226 S. K. Mondal et al.

Zhang D, Li M, Li CH, Zhou M (2007) Phrase reordering model integrating syntactic knowledge for SMT.
In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning (EMNLP-CoNLL), pp 533–540
Zhao Y, Xiang L, Zhu J, Zhang J, Zhou Y, Zong C (2020) Knowledge graph enhanced neural machine
translation via multi-task learning on sub-entity granularity. In: Proceedings of the 28th International
Conference on Computational Linguistics, pp 4495–4505
Zhao Y, Zhang J, Zhou Y, Zong C (2020) Knowledge graphs enhanced neural machine translation. In:
IJCAI, pp 4039–4045
Zhou J, Cao Y, Wang X, Li P, Xu W (2016) Deep recurrent models with fast-forward connections for neural
machine translation. Trans Assoc Comput Linguist 4:371–383
Zhou W, Ge T, Mu C, Xu K, Wei F, Zhou M (2019) Improving grammatical error correction with machine
translation pairs. arXiv preprint arXiv:​1911.​02825
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The united nations parallel corpus v1. 0. In: Proceed-
ings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp
3530–3534
Zollmann A, Venugopal A, Och FJ, Ponte J (2008) A Systematic Comparison of Phrase-Based, Hierarchi-
cal and Syntax-Augmented Statistical MT. In: Proceedings of the 22nd International Conference on
Computational Linguistics (Coling 2008), pp 1145–1152

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

13

You might also like