You are on page 1of 4

The Remaining Items from

Seminar 1

1. Machine Translation and Post-editing as Part of the End-to-End Localization Process.


The Impact of MT Developments on the Localization Industry.
The Right Solution for the Right Content.
Content Evaluation.
2. MT Output Evaluation and Testing Methodologies.
Automated MT Testing.
Human MT Testing.
Appendix 1: Detailed Information on Automated MT Evaluations.
3. Integration of MT into workflow. The place of MT in the translation producti

Appendix 1: Detailed Information on Automated MT Evaluations This appendix contains more


detailed information on automated MT evaluation methods.

BLEU (BiLingual Evaluation Understudy) score: This algorithm aims to evaluate the
quality of text that has been machine translated. The central idea behind BLEU is "the closer a
machine translation is to a professional human translation, the better it is. To assess this, scores
are calculated for individual translated segments-generally sentences - by comparing them with
a set of good-quality reference translations. Those scores are then averaged over the whole
corpus to reach an estimate of the translation's overall quality. Even though BLEU has become a
standard in the industry, it has its limitations. Intelligibility or grammatical correctness are not
taken into account explicitly, for instance, as they are supposed to be included in the correct
reference translations.

NIST: The name of this metric comes from the US National Institute of Standards and
Technology This measure is based on the BLEU score, but it differs from the BLEU algorithm in
several ways.

While BLEU simply calculates how many n-grams match both in the reference translation and
in the MT output and gives these n-grams the same weight, NIST also calculates how
"informative" a particular n-gram is. When a correct in-gram is found, the algorithm measures
whether that combination is a common sequence in the corpus material or whether that
fragment is not that common in the data. Depending on the result, an n-gram will be given more
or less weight. To give an example, if the bigram "on the" is correctly matched, it will receive a
lower weight than the correct matching of the bigram "interesting calculations," as this is less
likely to occur.

NIST also differs from BLEU in terms of how some penalties are calculated. For example, small
variations in translation length do not impact the overall NIST score as much as in BLEU..
METEOR (Metric for Evaluation of Translation with Explicit ORdering): This metric was
designed to address some of the problems found in the more popular BLEU metric, and also
produces a good correlation with human judgment at the sentence or segment level (this differs
from the BLEU metric in that BLEU seeks correlation at the corpus level). With this system,
several features that had not been part of any other metrics at the time were introduced.
Matches in METEOR are made by following the parameters below, among others:

 Exact words: As with other metrics, a match is made if two words are identical in the
machine translation output and the reference translation

 Stem: Words are reduced to their stem form. If two words have the same stern, a match
is also made

 Synonymy: Words are matched if they are synonyms of one another. Words are
considered synonymous if they share any synonym sets according to an external
database

Levenshtein distance: This metric measures the similarity or dissimilarity ("distance")


between two text strings by calculating the minimum number of single-character edits
(insertions, deletions and substitutions) required to change one word into another. In the field
of machine translation, this can be done by comparing the raw MT output to the human
translation.

Let's look at a couple of examples:

The Levenshtein distance between "sport" and "short" is 1, because one edit is required to
convert one word into the other (replace "p" with "h). The Levenshtein distance between "dog"
and "frog" is 2, as it is not possible to convert the first word into the second with fewer edits
(replace "d" with "f" and add "r").

This algorithm always has a maximum value that corresponds to the maximum length of both
input strings. In the case that two words do not have anything in common, the minimum
number of edits will not exceed the maximum number of characters in the longer string.
Example: If we have "computer" and "alibi", the Levenshtein distance will be 8 and no higher
than 8:

Replace "c" with "a" Replace "o" with "I"


Replace "o" with "I"

Replace "m" with "I"


Replace "p" with "b"

Replace "u" with "I"

Delete "t"

Delete "e"

Delete "r"

As with other automated measures, the results of the Levenshtein distance are not set in stone.
As mentioned before, there can be many correct translations for a single source. The
Levenshtein distance will not be able to measure quality on its own. Results will vary, for
example, if clauses are positioned differently in the MT output and in the human reference
translation.

Example

MT: "If I go home after 10pm, I will let you know" Reference human translation: "I will let you
know if I go home after 10 pm

In this case, the MT output is correct and no changes would be necessary during a post-editing
stage. However, the Levenshtein distance will be quite high, as many changes would be required
to turn the first sentence into the second one. TER: This is a word-based metric that calculates
the minimum number of edits required to match an MT output to a correct reference
translation, normalized by the length of the reference.

# of edits

TER average # of reference words

TERP (TER-Plus) is an extension of Translation Edit Rate (TER) and builds on the success of
TER as an evaluation metric and alignment tool. At the same time, it addresses several of TER's
weaknesses through the use of paraphrases, morphological stemming and synonyms, as well as
edit costs that are optimized to correlate more closely with various types of human judgments.
Put simply, TERP measures the number of edits that are necessary to go from the raw MT
output to a final edited version. As such, it is a helpful metric to measure typing and editing
effort. The TERP score is a number from 1 to 100; the higher the number, the more editing was
required.

You might also like