Professional Documents
Culture Documents
Journal of
Language and Translation
Volume 12, Number 4, 2022 (pp. 107-118)
ABSTRACT
The use of machine translation to communicate and access information has become increasingly
common. Various translation software and systems appear on the Internet to enable interlingual
communication. Accelerating translation and reducing its cost are other factors in the increasing
popularity of machine translation. Even if the quality of this type of translation is lower than human
translation, it is still significant in many ways. The MQM-DQF model provides standards of error
typology for objective and quantitative assessment of translation quality. In this model, two criteria
(accuracy and fluency) are used to assess machine translation quality. The MQM-DQF model was used
in this study to assess the quality of Google Translate performance in translating English and Persian
newspaper texts. Five texts from Persian newspapers and five texts from English newspapers were
randomly selected and translated by Google Translate both at the sentence level and the whole text. The
translated texts were assessed based on the MQM-DQF model. Translation errors were identified and
coded at three severity levels: critical, major, and minor errors. By counting the errors and scoring them,
the percentage of accuracy and fluency criteria in each of the translated texts was calculated. The results
showed that Google Translate performs better in translating texts from Persian into English;
furthermore, in sentence-level translation, it performs better than the translation of the whole text.
Moreover, translations of different newspaper texts (economic, cultural, sports, political, and scientific)
were not of the same quality.
Keywords: Accuracy; Fluency; Machine Translation (MT); MQM-DQF Model; Translation Quality
Assessment (TQA)
INTRODUCTION
In Translation Studies, the issue of translation Assessing the quality of translation is very
quality has always been of great importance. In complicated due to the subtleties of natural
order to eliminate the effect of subjectivity on language. One of the issues that have always
translation quality assessment (TQA), been considered in machine translation is the
assessment should be done based on pre- methods and parameters for assessing
defined criteria and models. “The main goal of translation results. “The evaluation of machine
Translation Quality Assessment (TQA) is to translation (MT) systems is a vital field of
maintain, and deliver to the client, buyer, user, research, both for determining the effectiveness
reader, etc., of translated texts” (Doherty, 2017, of existing MT systems and for optimizing the
p. 131). performance of MT systems” (Dorr et al, 2011,
* Corresponding Author’s email: p. 745). As the translation may be done by
zahraforadi@birjand.ac.ir human or machine, translation assessment also
108 Assessing the Performance Quality of Google Translate…
may be done either by human or machine; to do was conducted in 2011 on 11 current quality
so, a set of predefined criteria can be used as an evaluation models. It was the starting point of
instruction by a human assessor or be defined in benchmarking translation quality evaluation
the form of a quality assessing software that models to achieve a comprehensive TQA model
performs the assessment automatically; each of (O’Brien, 2012).
them has advantages and disadvantages. “Translation quality assessment (TQA) has
Automatic assessment is low-cost and suffered from a lack of standard methods.
objective, but types of errors in texts are not Starting in 2012, the Multidimensional Quality
well shown. The most valid technique for Metrics (MQM) and Dynamic Quality
judging the quality of translation is the manual Framework (DQF) projects independently
assessment by human assessors, but the began to address the need for such shared
inevitable negative point is that it is time- methods” (Moorkens et al. 2018, p. 109).
consuming and costly (see Maučec & Donaj, According to Lommel et al. (2018, p. 23), “the
2019). Furthermore, human assessment Multidimensional Quality Metrics (MQM)
depends heavily on the skills of the assessor. framework for describing and defining
Eventually, the performance quality of a translation quality assessment metrics was
machine assessor must also be assessed by a developed in the EU-funded QTLaunchPad
human assessor. project”.
The structure of a translation quality Some studies in the field of TQA have been
assessment model includes the classification of conducted based on the MQM model, including
types of errors and their weighting based on “The evaluation of the quality of
severity levels. If translation errors are Crowdsourcing Translations of Wikipedia
considered throughout the text in general and Articles based on the MQM model” (Vahedi
finally a score is assigned to the whole text, the Kakhki, 2018), whose results showed the
assessment is holistic, and if the errors are relatively high quality of the translation of these
examined in detail, the assessment is analytic. texts.
Until recent years, all the assessment “Dynamic Quality Framework (DQF) is a
approaches have been holistic (Karimnia, 2011) comprehensive suite of tools for quality
and evaluation criteria have been mainly evaluation of both human and machine
subjective (Saldanha & O’Brien, 2013). translation developed by Translation
According to Waddington (2001), the Automation User Society (TAUS)” (Lommel et
literature review on TQA shows that almost all al, 2018, p. 27).
the previous studies have been descriptive or According to Moorkens et al (2018, p 109),
theoretical. Waddington’s model was “in 2014 these approaches were integrated,
introduced through empirical studies, which centering on a shared error typology (the
have been used in numerous research to assess ‘DQF/MQM Error Typology’) that brought
the quality of translation (see, for example, them together.”
Andishe Borujeni, 2020). In this way, standard benchmarks were
With the expansion of machine translation established to assess the quality of translation.
use and increasing awareness of translation “This approach to quality evaluation provides a
consumers, the need for more detailed common vocabulary to describe and categorize
translation evaluation models, which could be translation errors and to create translation
used for machine translation evaluation, quality metrics that tie translation quality to
gradually increased. Around 2011, TAUS specifications” (Moorkens et al, 2018, p. 109).
(Translation Automation User Society) sought From the perspective of this model, two
to create a new way of measuring translation quantified dependent variables, namely
quality to satisfy its customers. O’Brien accuracy, and fluency, influence the
undertook a project in collaboration with TAUS independent variable, i.e. quality.
to explore the potential of a dynamic quality The MQM-DQF model was the metric used
estimation model. The first part of this study in the present research in order to evaluate the
JLT 12(4) – 2022 109
Figure 1
A subset for MT analysis (Lommel and Melby, 2018, p. 30)
Grammar: Are the grammatical rules in the user ignores. Minor errors do not pose a
translated text followed correctly? problem in conveying meaning. The default
Unintelligiblity: Does the translated text score for this group is 1.
convey its original meaning clearly? Null errors include errors that are ignored by
default before text assessment begins. For
Error severity levels (weighting) example, in assessing translation of a brochure
Evaluating a translation is not enough just to related to the instructions of an electrical
know the number of errors. “Evaluators also device, the style of writing may not be essential,
need to know how severe they are and how so it will not be part of the translation criteria at
important the error type is for the task at hand” all, and if such an error is seen in the text, it will
(Moorkens et al, 2018, p. 120). According to be ignored. The score of these errors is always
Lommel et al (2018, p. 33), “severity level is an zero (Moorkens et al., 2018, pp. 116–122).
indicator of the importance of an issue with an According to the definitions of the severity
accompanying numerical representation and level of error, each previously detected type of
weight is a numeric representation of the error was labelled. Then, the percentages of
importance of an issue type in a specific both accuracy and fluency criteria were
metric”. calculated using the following formula:
Critical error is an error the user does not
notice. Critical errors per se make the Score
translation unsuitable for its intended purpose. = 1
However, a critical error can prevent the minor(1) + major(10) + critical(100)
−
translation from achieving its purpose. For Word Count
example, if in the translation of an industrial The final score is numerically between zero
text a product weighs 2 pounds (approximately and one.
0.9 kg) and it is translated into 2 kg, the result
of this translation may be to the detriment of the The point here is to count the words in the
user while he/she does not notice it; therefore, translated texts correctly. Google does not
it is a critical error. The default score for each regard zero-width non-joiner (ZWNJ) in
of these errors is 100. translating from English into Persian; therefore,
Major errors in translation make the in translating the five texts from English into
meaning of the text unclear to the user. For Persian, formal Persian editing was done to
example, if in translating an educational text correct the ZWNJ to obtain the exact number of
about insects from Italian into English, the word words in the translated texts.
“ape” that means “bee” in Italian and “monkey” By weighing the errors and calculating their
in English is translated as “monkey”, it has scores, the rate of accuracy and fluency criteria
negative consequences but the user is aware of can be reported quantitatively (percentage),
this error. However, this type of error does not using descriptive statistics (frequency and
harm the user because he realizes this error. The percentage calculation).
default score for this group is 10.
If there is more than one major error per RESULTS
thousand words in a translated text by a holistic The quality of Google Translate performance in
assessment method, the assessor is required to translating English newspapers (economic,
change the assessment method to an analytic cultural, sport, political, and scientific) texts
one and find the exact cause of errors. into Persian and the same Persian newspaper
Minor errors do not affect the usability of a text types into English, at both sentence-level
translated text. In many cases, the user translation and whole-text translation, was
consciously ignores these types of errors. For examined based on MQM-DQF model and
example, in translating a text into English, the compared by calculating percentages of two
translator may say to who it may concern criteria or variables, i.e. accuracy and fluency.
instead of to whom it may concern, which the As mentioned before, the object of the current
112 Assessing the Performance Quality of Google Translate…
study was not to compare the translation quality percentage of both criteria in Persian–English
of Google Translate in translating different translation is more than English–Persian
types of newspaper texts (i.e., scientific, translation.
cultural, political, economic, sports, etc.) and Sport texts: In English to Persian
such a comparison was postponed to future translation, all accuracy errors were related to
research. The results are as follows: the mistranslation subcategory at both levels,
and most fluency errors were related to the
Research Findings unintelligibility subcategory.
Economic texts: In English to Persian In Persian into English translation, most
translation, at both levels, the maximum accuracy errors were related to terminology
number of accuracy errors belonged to subcategory and about fluency errors; most of
mistranslation subcategory and in the case of them were related to unintelligibility.
fluency criterion, the maximum errors belonged Based on the percentages calculated for each
to unintelligibility subcategory. of the criteria in the two translated texts,
In Persian to English translation, at both percentages of accuracy and fluency criteria in
levels, the maximum number of accuracy errors the complete translation of each text were less
was related to the terminology subcategory and or equal to the sentence level translation. At
in the case of fluency criterion, the maximum both translation levels, accuracy in Persian to
errors were related to the grammar subcategory. English translation was less than English to
According to the calculation of each Persian translation, but fluency in Persian to
criterion percentage in the translation of the two English translation was more than English to
selected newspaper texts (Table 1 and Table 2), Persian translation.
the percentages of accuracy and fluency criteria Political texts: In English to Persian
in the whole translation of each text are lower translation, at both levels, the highest number of
than the sentence level translation, and the errors in accuracy criterion belonged to
percentages of both criteria in Persian into mistranslation subcategory and about fluency
English translation are more than the exact criterion the most errors belonged to
percentages in English–Persian translation. typography subcategory.
Cultural texts: In English to Persian In Persian to English translation, at both
translation, at both levels, the maximum levels, the highest number of errors in terms of
number of accuracy errors belonged to accuracy was related to mistranslation
terminology and the maximum number of subcategory and regarding fluency, the highest
fluency errors belonged to the unintelligibility number of errors was related to unintelligibility.
subcategory. Based on the percentages calculated for each
In Persian to English translation, the of the criteria in the two selected texts, accuracy
maximum number of accuracy errors at the and fluency in the complete translation of each
sentence level translation was related to text were lower than translation at the sentence
terminology, and in the translation of the whole level. In both translation levels, accuracy in
text, it was related to the subcategory of Persian–English translation was lower, but
mistranslation. The maximum number of fluency was higher than English–Persian
fluency errors in translating the whole text was translation.
related to the unintelligibility subcategory and Scientific texts: In English to Persian
at the sentence level translation, the number of translation, the number, type and severity of
errors in two subcategories of unintelligibility accuracy and fluency errors were similar at both
and grammar was the same. levels. The highest number of accuracy errors
Based on the percentages calculated for each was related to mistranslation, and the highest
of the criteria in two translated texts, the number of fluency errors was related to
percentages of accuracy and fluency criteria in unintelligibility.
the complete translation of each text are less The number and type of accuracy and
than the sentence level translation, and the fluency errors in English to Persian translation
JLT 12(4) – 2022 113
were similar at both levels. Most accuracy to English translation, accuracy criterion
errors were related to mistranslation. Half of the percentage in sentence-level translation was
fluency errors were related to unintelligibility higher, and fluency percentage was lower than
and the other half were related to grammar. complete text translation. At both levels of
According to Table 1 and Table 2, accuracy translation, the percentage of both criteria was
and fluency criteria in English to Persian higher in Persian–English translation compared
translation were equal at both levels. In Persian to English–Persian translation.
Table 1
Percentages of quality criteria for newspaper texts translated from English into Persian (at the level
of sentence and whole text)
Table 2
Percentages of quality criteria for newspaper texts translated from Persian into English (at the level
of sentence and whole text)
Texts
Economic Cultural Sport Political Scientific
Fluency 99.8 95.2 98.2 94.6 94.8 93 99.8 94.9 98.2 98.2
Frequent errors in translating these texts by ( برای عالمت زدن هر جعبهbaray e alamat zadan e
Google Translate are summarized below: har ja'be) into Persian which is a word by
• Google Translate could not understand the word translation.
implicit meaning of many satires, proverbs,
• In the UK, the Prime Minister’s building
metaphors, allusions, idioms, slang terms,
number is 10, and across the UK No. 10
cultural elements, and similar items, and has
implicitly means government; however,
made mistakes in translating them. For
Google Translate transmitted exactly the
example, the phrase to tick every box is an
phrase No.10 into the translation, without any
idiom and means “to satisfy all of the
additional cultural explanation.
apparent requirements for success” (Collins
dictionary, 2021). But such a concept is not • About some children’s activities, which have
transmitted to the Persian-speaking reader, cultural aspect, Google Translate does not
because Google Translate has translated it as have the appropriate equivalent words and
114 Assessing the Performance Quality of Google Translate…
phrases that can be understood in the target which are understandable for the reader
culture, including rainbow trails and according to context.
doorstep discos or electric milk float that
• Google Translate made mistakes in
means milk distribution car. However, it has
translating proper nouns from Persian into
been translated as( شناور شیر الکتریکیshenavar
English and translated these nouns instead of
e shir e electrici) into Persian. They are
just transliterating them, such as translating
considered major errors because Google
Fekri (person’s name) into intellectual or
Translate has translated them word by word
Esteghlal and Nassaji (club name) into
into Persian with no additional explanation.
independence and textile.
• Google Translate may make mistakes in
• Google Translate has made mistakes in
translating words with several common
converting and transferring units of weight,
meanings (polysemy), and it may misuse
distance, time, volume, etc. For example, the
different meanings. For example, the word
solar year 1398 has been transferred without
pilot, which has several meanings, should
conversion, so the reader realizes this error,
have been translated as ( پارکینگ مسطحpilot
and it is a major accuracy error. However, in
parking), but has been translated as خلبان
another text translation, Google Translate
(aviator).
mistranslated the same solar year into 2009,
• Google Translate may leave unfamiliar words which was considered a critical accuracy
in the target culture untranslated or translate error.
them as the most similar word structure it
• Google Translate could not recognize words
knows. For example, parklet that means small
with ZWNJ in Persian texts and sometimes
park, has been translated as ( پارک جیبیpocket
connects parts of such words and translates
park), which is similar to the structure of the
the new words. Sometimes this connection
word booklet.
creates an irrelevant word, such as the phrase
• There were also some untransliterated words ( کمکاری برخی از بازیکنان استقاللthe negligence
in these translations, most of which being of Esteghlal players) where the word کمکاری
proper nouns, and they are minor fluency (negligence) has ZWNJ and has been
errors. Google Translate has transmitted changed to ( کمکاریhelping), and then it has
English proper nouns exactly in Latin and not been translated to some Esteghlal players
in Persian alphabet, such as CNBC and Covid, were helping.
Table 3
Comparison of percentages averages (se is the abbreviation for sentence level, wh for whole text, A
for accuracy and F for fluency)
Accuracy Accuracy
Fluency average Fluency average
average in average in
in sentence level in whole text
sentence level whole text
translation translation
translation translation
English to Persian x̅A.se= 84.28 x̅A.wh= 82.04 x̅F.se= 92.52 x̅F.wh= 91.46
Persian to English x̅A.se= 91.08 x̅A.wh= 85.26 x̅F.se= 98.16 x̅F.wh= 95.18
*A (Accuracy)
*F (Fluency)
*se (sentence level translation)
*wh (whole text translation)
The average of accuracy percentages in 82% in complete text translation. The average
translation from English into Persian is about fluency percentages are approximately 92%
84% in sentence-level translation and about and 91% in sentence-level translation and
JLT 12(4) – 2022 115
complete translation, respectively (Table 3). languages are often relatively poor” (Aiken &
Persian's average accuracy percentages into Balan, 2011, para. 1). The results were
English are about 91% in sentence-level confirmed by Benjamin (2019): The corpus of
translation and about 85% in complete text stored texts in Google Translate is richer for
translation. The average of fluency percentage upper languages, English above all, and poorer
is approximately 98% in sentence-level for bottom languages. On the other hand, in
translation and about 95% in whole text translating all pairs of non-English languages,
translation (see Table 3). Google acts indirectly; first, it translates each
As Table 3 shows, all the averages of language into English and then translates it into
accuracy and fluency percentages in the texts the other language. Therefore, the corpus of
translated from Persian into English were English texts in Google Translate is more
higher than the averages of the texts translated prosperous than other languages (Benjamin,
from English into Persian. According to Table 2019). But according to a 2016 study on Google
1 and Table 2, all percentages related to Translate performance, the difference between
sentence-level translations are equal or higher the frequencies of different types of English–
than whole text translations. Persian and Persian–English errors did not
reach statistical significance, the conclusion
DISCUSSION being that the direction of translation does not
Machine translation (MT) is related to how a affect the quality of the translation of Google
computer software translates texts from one Translate (Ghasemi & Hashemian, 2016, pp.
language into another without human 15–16). As confirmed in the present study,
intervention. The latest approach in machine Google’s errors in translating texts from Persian
translation and the approach used in this into English are more than translating from
research is the Neural Machine Translation English into Persian.
approach (NMT). Google Neural Machine In 2019, a study was conducted on fifty
Translation (GNMT) is an NMT system that languages used in Aiken & Balan’s (2011)
was developed by Google and was introduced study using the same sample texts, which aimed
in November 2016. It uses an artificial neural to compare the improvement of Google
network to increase fluency and accuracy in Translate in terms of accuracy. In 2011, the
Google Translate. It was also stated that with approach of Google Translate was PBMT
the upgrade of Google to NMT, translation of (Phrase-Based Machine Translation), which
texts would be done at the sentence level upgraded to NMT eight years later. In the 2011
(Turovsky, 2016). study, the BLEU score in translation from
In recent years, researches have been English into Persian was 16. The 2019 study
conducted on the quality of online machine gave a 39 BLEU score for translation from
translation from English into Persian and vice- English into Persian and a 66 BLEU score for
versa, either by human assessor or machine translation from Persian into English (Aiken,
assessor, in various genres. However, it should 2019). As it was found in the current research,
be noted that the results of these researches can the quality of Google translation in translating
no longer be reliable today due to the from English into Persian is better than
improvement of approaches to machine translating from Persian into English.
translation. Here we discuss these researches According to a study conducted in 2018 to
and their results. evaluate the translation quality of three
In 2011, an automated accuracy evaluation translation machines (including Google
was done using 51 languages with Google Translate) from Persian into English, out of six
Translate based on BLEU. Due to a large common error categories, the most common
number of languages in this study, human categories were missing words, extra words,
evaluation was not used. The results showed and word orders (Sharifiyan, 2018).
that “translations between European languages In the present study, in both English into
are usually good, while those involving Asian Persian and Persian into English translations,
116 Assessing the Performance Quality of Google Translate…
the highest number of accuracy errors was Translate has a better quality performance in
mistranslation and the highest number of translating newspapers texts at the level of
fluency errors was related to unintelligibility. sentence than the level of whole text.
Among the subgroups, errors of grammar
subcategory including morphology, part of REFERENCES
speech, agreement, word order and function Aabedi, F. (2017). Quality Estimation and
words, the highest number of errors in Generated Content of Machine
translation from English into Persian and vice Translation Programs: Babylon 10
versa was related to the part of speech. Premium Pro Review and Google
As already mentioned, due to the Translator (Master’s thesis, Faculty of
improvement of machine translation Literature and Human Science, Imam
approaches, the results of previous research in Reza International University, Mashhad,
evaluation of machine translation quality have Iran). Retrieved May 8, 2021, from
been improved and this trend continues, and https://ga nj.irandoc.a c.ir/#/artic les/8
because of this progress in machine translation fb842d6931a485d5874ea36c95e49cb
quality, translation quality assessment models Aiken, M., & Balan, S. (2011). An analysis of
need to be improved as well. Google Translate accuracy. Translation
Journal, 16(2), April. Retrieved July 12,
CONCLUSION 2021, from https://t ransla tionjou rnal.ne
We used MQM-DQF model, one of the most t/journal/56google.htm.
up-to-date TQA models, and its two criteria (i.e. Aiken, M. (2019). An Updated Evaluation of
accuracy and fluency) and their subsets to Google Translate Accuracy. Studies in
assess Google Translate quality performance in Linguistics and Literature, 3. p. 253.
translating newspaper texts (Persian and 10.22158/sll.v3n3p253.
English). Andishe Borujeni, F. (2020). The quality of
It was found that the percentages of translation of cultural elements
accuracy criterion in five English newspapers translated by human and machine in
texts translated into Persian were between Natour Dasht novel based on
72.2% and 92.2%, and the percentages of Weddington model (Master’s thesis,
fluency criterion in the texts translated were Faculty of Foreign Languages, Sheikh
between 89.5% and 94.5% (Table 1). The Baha'i University). Retrieved Avril 25,
percentages of accuracy criterion in five Persian 2021, from http s://gan j.irando c.ac.ir
newspapers texts translated into English were /#/articles/24d46684b88c4d7010f7c2de
between 76.4% and 96.6%, and the percentages 74fb27ac
of fluency criterion in the translated texts were Benjamin, M. (2019). When & How to Use
between 93% and 99.8% (Table 2). The Google Translate, Retrieved March, 27,
accuracy and fluency criteria in sentence-level 2021, from https:// www.tea chyouba
translation were equal or higher than whole-text ckwards.com/how-to-use-google-tran
translation. All the averages of accuracy and slate/
fluency in the newspapers translated from Collins dictionary online. (2021). Retrieved
Persian into English were higher than those of June 9, 2021, from https:// www.coll
the newspapers translated from English into insdictionary.com/ dictionar y/english/t
Persian. Therefore, it can be concluded that ick-all-the-boxes
Google Translate has a better performance in Doherty, S. (2017). Issues in human and
translating Persian newspaper texts into English automatic translation quality assessment.
than translating newspaper texts from English In D. Kenny (Ed.), Human issues in
into Persian. Moreover, the averages of translation technology (pp. 131 – 148).
accuracy and fluency in the sentence-level London, UK: Routledge.
translation are higher than whole text Dorr, B., Olive, J., Mccary, J., & Christianson,
translation; therefore, it can be said that Google C. (2011). Machine Translation
JLT 12(4) – 2022 117