You are on page 1of 8

MT Decider Benchmark — Language Pair Report

© 2022, Polyglot Technology LLC

Evaluation Quarter : 2022Q2


Translation Direction : French → English

Introduction
Online machine translation (MT) services offer affordable, high-quality translations. Translation providers, web
publishers and other machine translation users want to use the best MT for the languages they need to
translate. It is difficult to evaluate the black-box, ever changing MT services. Machine translation services are
often chosen by word-of-mouth, vendor messaging, tool integration availability or ad-hoc translation of a
handful of sentences.

What is missing is a vendor-independent, transparent, and up-to-date evaluation of MT services. The MT


Decider Benchmark, created by Polyglot Technology, is the solution. The benchmark provides reliable, detailed
and actionable data for decision makers. The MT Decider Benchmark, published quarterly, consists of a summary
report with evaluation results and analysis across 24 language pairs and 4 online machine translation services
and individual reports for each translation direction in each language pair (i.e. 48 reports).

Language pair reports, as the one you are reading, contain detailed evaluation results for one translation
direction in the language pair along with segment analysis showing the textual differences in translations
between the best performing MT service and the second best performing engine.

Machine Translation Evaluation


For determining whether machine translation is of good quality or not, human evaluation is the best method.
We can ask speakers of the source language and the target language, or better yet, professional translators, to
judge whether machine translations are adequate and fluent translations of the original text. We can ask
evaluators how close the machine translation is to a human reference translation or to annotate errors in the
translations. Human evaluation however is slow and hard to scale along many language pairs and domains.
Repeating human evaluation periodically in a consistent way is also challenging.
Automatic metrics that use human reference translations allow to calculate numeric scores for machine
translation quality.

For 20 years now the predominant automatic metric has been BLEU, measuring the similarity of machine
translations to human reference translations on a scale from 0 to 1 (or 0 to 100 when expressed as percentages).
More details on BLEU and how to interpret it can be found in the section Interpreting BLEU Scores below.

To calculate BLEU for this report we use the open source package sacreBLEU. We would like to thank Matt Post
and all contributors to sacreBLEU for making this invaluable tool available.

The second automatic metric that Polyglot Technology chose for the benchmark is the novel MT evaluation
method COMET, developed by Unbabel. Unlike BLEU, which matches phrases between the machine translation
output and the human translation target language text COMET also takes the source text into account and rates
translations on their semantic content rather than just surface words. In a study conducted in 2021 COMET has
shown close correlation with human judgments of translation quality.

COMET scores are floating point numbers. The bigger they are, the better. Comparing scores across different
language pairs or test sets, as with BLEU, is not meaningful. The relative difference COMET scores between two
translations, however small it is, is an accurate reflection of human judgements of differences in translation
quality.

More details on COMET and how to interpret it can be found in the section Interpreting COMET Scores below.

Description of Test Sets


High-quality test data, the human reference translations, are the most important factor in calculating meaningful
metrics. For the MT Decider Benchmark we are using high quality human reference translations from the
Conference on Machine Translation (WMT) and the International Conference on Spoken Language Translation
(IWSLT), the two premier academic conferences on machine translation.

Over the years the conferences have published test sets in various language pairs - we used the data available in
the sacreBLEU evaluation tool as of version v2.0.0:

WMT: news texts in 24 language pairs


IWSLT: translated TED talks in 25 language pairs

If test sets were available from multiple years, we chose the most recent test set to keep evaluations relevant.

We did not evaluate English↔Korean because the Korean tokenizer was not available in the sacreBLEU version
v2.1.0 that we used.

Here are the segment counts for the translation direction:

Test set counts for : French → English


Test set Segment count

iwslt17 1455
Test set Segment count

wmt15 1500

Evaluated Machine Translation Services


Polyglot Technology evaluated globally available online MT services that users can sign up for with a credit card,
focusing on providers in the Americas and Europe:

Amazon Translate
DeepL
Google Translate
Microsoft Translator

We are interested to include additional providers in the MT Decider Index in the future, provided that users can
sign up with a credit card and the services are available and accessible globally.

The machine translations for the benchmark were retrieved on 22nd and 23rd June 2022.

The evaluated services support all evaluated language pairs with the exception of DeepL, which of time of the
evaluation did not support the language pairs English↔Gujarati, English↔Hindi, English↔Tamil, and
English↔Kazakh.

Many Computer Aided Translation (CAT) tools, Translation Management Systems (TMS) and Content
Management Systems (CMS) support integrating all or some of the evaluated MT services. Please refer to your
CAT/TMS/CMS documentation or contact us at info@polyglot.technology.

Automatic Evaluation Scores


BLEU Score Results
Here are the BLEU score results (measured with sacreBLEU v2.1.0) for the translation direction:
When test sets are available from WMT and IWSLT, more often than not, the ranking of the MT services will be
the same or similar between the test sets - choosing the best service will be straightforward.

Should the BLEU score results differ significantly between WMT and IWSLT then the choice of MT service
depends on the type of content you are translating: if the content is more formal and the content type
something other than transcribed speech, then WMT BLEU scores are more relevant, if the content is more
informal and/or it is transcribed speech, then IWSLT BLEU scores are more relevant.

COMET Score Results


Here are the COMET score results (measured with COMET v1.1.2 using the default wmt20-comet-da model) for
the translation direction:

As described in the section Machine Translation Evaluation the COMET score, unlike the BLEU score, takes the
source text into consideration and produces a score that is more related to the meaning of the text that the
surface form of the words of the translation.

In some cases the MT service rankings differ between BLEU and COMET. If this happens, you have to decide
what is more important for your translation projects - adherence to word surface forms (including terminology),
then the BLEU ranking is more relevant, if translations can be more free-form and meaning representation is
more important, then the COMET ranking is more relevant.

Segment Analysis
Measuring a single numerical score like BLEU or COMET on a high-quality test set in a predictable fashion is a
very good indicator of machine translation quality.

Seeing translation differences between MT services is also very informational. For this we empoly segment
analysis:

1. We pick the two MT services with the two highest COMET scores on the WMT test set - say "MT service 1"
and "MT service 2"
2. We calculate the COMET scores on all translations in the test sets (we use COMET, because it produces
reliable segment level scores, unlike BLEU, which should only be calculated on an entire test set)
3. We calculate the COMET score difference between "MT service 1" and "MT service 2"
4. We sort the translations by the COMET score difference, i.e. the difference in machine translation quality
between "MT service 1" and "MT service 2"
5. We pick 20 segments at random, still sorted by the score difference, and mark the edits needed to
transform the translation of "MT service 1" into "MT service 2"

This produces the following table of translation differences between the best scoring engine and the second
best scoring engine, from the most improved translations to the least improved translations.

Source text Target text Difference between deepl and google

Bon, ça ne va pas du tout. Well, that is no good at all. Alright, iWell, that's not okaygood at all.

La « dette », c'est des d'actifs détenus “DThe "debt”" is assets held


The "debt" are assets held mainly by
essentiellement par des institutions primarimainly by financial institutions in
financial institutions in this country.
financières dans ce pays. thatis country.

Ca a l'air horrible. That sounds terrible. It looks awfulhorrible.

On dirait que Mary Whitehouse est revenue It seems Mary Whitehouse has arisen LIt looks like Mary Whitehouse cahas
d'entre les morts. from her grave come back from the dead.

Ils sont une menace dans nos vies They are a threat in our daily lives
They are a threat in our daily lives by the
quotidiennes par la même et par notre due to the same and our money
same and by our money that they hold..
argent qu'ils détiennent.. which they are in possession of...

Farage's sentence came as a great


La phrase de Farage a beaucoup surpris, Farage's comment came as quite a
surprised many, but only because this
mais seulement parce qu'on parle si shock, but only because it is so rarely
problemissue is so rarely talked
rarement de ce problème. addressed.
aboutdiscussed.

Good evening. I think that the Good evening I think that the red caphat
Bonsoir Je pense que le bonnet rouge qui a
extremist who protested against who demonstrated against taxes is happy
manifesté contre les impôts est heureux de
taxes is happy to know that this trip to know that this trip is sponsored by the
savoir que ce voyage est sponsorisé par la
is sponsored by the Brittany Region Region of Brittany Region and the City of
Région Bretagne et la Ville de Brest.
and the City of Brest. Brest.

I would add that a decent


J’ajouterais qu’une assurance chômage I would add that a dignified
unemployment insurance contributes
digne participe à la valorisation du travail unemployment insurance contributes to
to the improved status of employed
salarié. the valuation of salaried work.
work.

Melbourne n'est vraiment pas le meilleur Melbourne definitely isn't the best Melbourne is really isn'not the best place
endroit pour un observatoire, même si les site for an observatory, although for an observatory, althougheven if things
choses s'améliorent grandement de l'autre things get much better on the other are looking up a lotgetting much better
côté de la Cordillère. side of The Dividing Range. on the other side of the Cordillera.

Sa solution est extrêmement discutable et


His solution is extremely debatable, His solution is extremely questionable
ce serait bien si les gens débattait de ce
and it would be nice if people would and it would be nice if people would
qu'il dit réellement plutôt que de
debate what he actually means rather debated what he actually saidys rather
simplement l'étiqueter comme
than just label him irresponsible and than just labeling him as irresponsible
irresponsable et mettre de côté les
sideline the concerns that he and a and pusetting aside the concerns of him
préoccupations de lui et d'une grandes
huge amount of disenfranchised and a large amountnumber of people
quantité de personnes laissées pour
people hold. being left behind.
compte.
Source text Target text Difference between deepl and google

Si la FOM a choisi de ne pas montrer ces If the FOM chose not to show these If the FOM has chosen not to show these
images c’est uniquement pour protéger ses images, it is only to protect its images it is only to protect its interests
intérêts en aucun cas ceux du pilote. interests, on no account the driver's. and in no waynot those of the pilot.

I managed to sample 19 different I managed to taste 19 different beers,


J'ai réussi à déguster 19 bières différentes,
ales, with Wobbly Weasel being a Wobbly Weasel being one of my
Wobbly Weasel étant une de mes préférées.
personal favourite. favorites.

Et s'il le défendait, il serait avalé par le If he did, he would get eaten up by And if he defended it, he would be
rouleur compresseur occidental. western justice system. swallowed up by the Western steamroller.

Your story makes me dream, by


Votre recit me fait rever, en m’identifiant un Your story makes me dream, identifying
identifying myself a little in passing (I
peu au passage (j’ai aussi appris a naviguer myself a little in passing (I also learned to
also learnt to sail with the Crocodiles
avec les Crocos de L’Elorn au Moulin Blanc sail with the Crocos de L'Elorn at the
de L'Elorn in the White Mill in Brest
en rade de Brest!). Moulin Blanc in the harbor of Brest!).
harbor!).

Bajevic était milieu de terrain ou avant dans Bajevic was a midfielder or forward in Bajevic was a midfielder or beforwarde in
les années 70 et 80 et a entraîné de the 70s and 80s, and coach of many the 70s and 80s and coached many teams
nombreuses équipes en Grèce, mais je ne teams in Greece, but I dunno where in Greece, but I don' not know where he
sais pas d'où il est. from. is from.

L’électricité nucléaire c’est zéro émission de Nuclear electricity means zero CO2 Nuclear electricity means zero CO2
CO2 à la production et epsilon pour les emissions during production and emissions inat the production stage and
autres étapes. epsilon for the other stages. epsilon for the other stages.

Luis' bottom layer being a bit dry was At Luis', the only real faultlaw they could
Chez Luis, le seul vrai défaut qu'ils aient pu
the only real fault they found (though find was the base being a bit
trouver c'était la base un peu sèche (même
I agreed with Hollywood that the drysomewhat dry base (although I agree
si je suis d'accord avec Hollywood pour dire
lime green icing on Richard's with Hollywood that the green icing on
que le glaçage vert sur le moulin de Richard
windmill was a bit off putting, though Richard's grinder didn't soundmill was
ne donnait pas envie, de toute façon
Richard had already blown it by that not appealing anyway, Richard had
Richard avait déjà perdu à ce moment-là).
point). already lost by then anyway).

Il faut choisir. A choice must be made. It's necessaryYou have to choose.

Do you know that you are not You know that you don't have to be
Vous savez que vous n’êtes pas obligé
obliged to be condescending and to condescending and consider your
d’être condescendant et de considérer
consider your interlocutor as a interlocutor as a retardtreat the other
votre interlocuteur comme un demeuré
backward person when you post a person like a moron when you post a
quand vous postez un commentaire?
comment? comment?

LettersAre literature, philosophy and art


Lettres, philosophie et art sont des filières English, Philosophy and Art are
are Care Bear sectorsthe same thing as a
bisounours ? Mickey Mouse degrees?
carefree course of study?

Interpreting BLEU Scores


The paragraphs in this section are excerpted from Google AutoML Translate's documentation page on evaluation
which is licensed under the Creative Commons 4.0 Attribution License.

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The
BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a
set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap
with the reference translation (low quality) while a value of 1 means there is perfect overlap with the reference
translations (high quality).
It has been shown that BLEU scores correlate well with human judgment of translation quality. Note that even
human translators do not achieve a perfect score of 1.0 (for the reason that a source sentence can have several
valid, equally appropriate translations).

Interpretation
Trying to compare BLEU scores across different corpora and languages is strongly discouraged. Even comparing
BLEU scores for the same corpus but with different numbers of reference translations can be highly misleading.

However, as a rough guideline, the following interpretation of BLEU scores (expressed as percentages rather
than decimals) might be helpful.

BLEU Score Interpretation

< 10 Almost useless

10 - 19 Hard to get the gist

20 - 29 The gist is clear, but has significant grammatical errors

30 - 40 Understandable to good translations

40 - 50 High quality translations

50 - 60 Very high quality, adequate, and fluent translations

> 60 Quality often better than human

The following color gradient can be used as a general scale interpretation of the BLEU score:

Interpreting COMET Scores


The paragraphs in this section are excerpted from Unbabel's FAQ page on COMET which is licensed under the
Apache License, Version 2.0.

Since we released COMET we have received several questions related to interpretabilty of the scores and usage.
In this section we try to address these questions the best we can!

Is there a theoretical range of values for the COMET regressor?


Before we dig deeper into details about COMET scores I would like to clarify something:

Absolute scores via automatic metrics are meaningless (what does 31 BLEU mean without context? it can be both
awesome score for News EN-Finnish or really bad score for EN-French), and pretrained metrics only amplify it by
using different scales for different languages and especially different domains.

Check Kocmi et al. 2021 and our discussion here: #18


Most COMET models are trained to regress on a specific quality assessment and in most cases we normalize
those quality scores to obtain a z-score. This means that theoretically our models are unbounded! The score
itself has no direct interpretation but they correctly rank translations and systems according to their quality!

Also, depending on the data that they were used to train, different models might have different score ranges.
We observed that most scores for our wmt20-comet-da fall between -1.5 and 1, while our wmt21-comet-qe-
da produces scores between -0.2 and 0.2.

You might also like