Re-Evaluating The Role of in Machine Translation Research

Re-evaluating the Role of Bleu
in Machine Translation Research

Chris Callison-Burch,
Miles Osborne and Philipp Koehn
April 7, 2006
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
1
Talk Overview
• How do we currently evaluate MT research?
• What assumptions does our methodology rely on?
• Are those assumptions valid?
• If not, what does that imply for the field?
2
Conducting Research in MT
• Posit theory of how to improve translation quality
• Change the behavior of a translation system accordingly
• Translate a set of test sentences
• Compare translations before and after change
• If better, then write a paper
3
Determining Goodness
• To determine if translation improved, we need to measure translation quality
• Can be done manually by judging a translation’s fluency and adequacy

Fluency Adequacy
5. Flawless English 5. All
4. Good English 4. Most
3. Broken English 3. Much
2. Disfluent 2. Some
1. Incomprehensible 1. None
4
Human v. Automatic Evaluation

• Human evaluation is accurate, but
– It’s time consuming
– It’s expensive
– It’s not easy to re-use
• We would like an automatic metric

– Which can be run quickly at no cost
– Which correlates with human judgments
• Accomplished by comparing to references
5
Difficulties of Automatic Evaluation of MT

• Different than Word Error Rate metric used in speech recognition
– WER assumes a single authoritative reference
– WER assumes linear ordering
• By contrast, translation has a range of possible realizations

– A variety of equally valid wordings
– Some phrases can be moved
6
Enter: Bleu
• “Bi-Lingual Evaluation Understudy”
• Allows multiple reference translations as an attempt to model the variety of

possible translations
• Matches n-grams from reference without putting explicit constraints on order
• Has been shown to correlate with human judgments of translation quality
7
Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams:
plane that would take him to Miami in Florida . 2-grams:
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .
8
Bleu Detailed
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams:
Hypothesis:
Florida .
9
Bleu Detailed
plane that would take him to Miami in Florida . 2-grams: 10
Hypothesis:
Florida .
10
Bleu Detailed
plane that would take him to Miami in Florida . 2-grams: 10
Rodriguez appeared calm as he was being led to the American 3-grams: 7
plane that was to carry him to Miami in Florida . 4-grams: 3
Hypothesis:
Florida .
11
Bleu Detailed
• Calculates n-gram precision pn for n = 1, 2, 3, 4 ... by summing over n-gram
matches for every hypothesis translation in test set
• Uses brevity penalty to compensate for lack of recall by penalizing translations

that are too short !
1 if h > r
BP =
e1−r/h if h ≤ r
• Bleu is defined as weighted geometric average of pn offset by BP

N
"
Bleu = BP ∗ exp( wn logpn)
n=1
12
Common Assumptions About Bleu

• Bleu is commonly reported as sole evidence of improved translation quality
in conference papers
• Sometimes failure to improve Bleu is taken as failure to improve translation

quality (see “Word Sense Disambiguation v. SMT”)
• This relies on two key assumptions:

– Accurately accounts for allowable variation in translation
– Correlates with human judgments
13
Are These Assumptions Valid?

• Does an improvement in Bleu score guarantee a genuine translation
improvement?
• Does a failure to improve Bleu always mean that translation quality has not
improved?
14
Not Always
• We show that in some cases a higher Bleu score is neither sufficient nor
necessary to ensure genuine translation improvement.
• We do this in two ways

1. By showing that Bleu has poor model of allowable variation and fails to
distinguish between translations of differing quality
2. By showing two significant counterexamples to Bleu’s correlation with human
judgments
15
Equally Scoring Translations: Permutations

• Because Bleu does not constrain order of n-grams, we can construct equal
scoring translations by permuting around bigram mismatch points
Appeared calm | when | he was | taken | to the American plane | , | which

will | to Miami , Florida .
• So this and 40,320 other candidates receive the same score:
which will | he was | , | when | taken | Appeared calm | to the American

plane | to Miami , Florida .
• Current systems produce translations with millions of similarly scoring

permutations, up to 1073. Likely to be judged equally valid?
16
Equally Scoring Translations: Substitutions

• Different items may be drawn from references and receive the same score
was being led to the | calm as he was | would take | carry him | seemed
quite | when | taken
• Unmatched words (when, taken) can be replaced by anything (black,

helicopters)
• Bleu’s model of allowable variation in translation is insufficient to distinguish

between different quality translations
• Bleu cannot be guaranteed to correlate with human judgments
17
Failures in Practice
• Criticism: Those are constructed examples, Bleu assumes cooperative
environment
• These failures happen in practice too

– In the 2005 NIST MT Eval, the 6th ranked Bleu system scored 1st in the
manual human evaluation
– Bleu incorrectly ranks poor phrase-based MT system higher than good
rule-based system
18
NIST 2005 Results

4
Adequacy
Correlation
3.5
Human Score
2.5
2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score
19
NIST 2005 Results

4
Adequacy
Correlation
3.5
Human Score
2.5
2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score
20
NIST 2005 Results

4
Fluency
Correlation
3.5
Human Score
2.5
2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score
21
Example
Reference: Iran had already announced Kharazi would boycott the conference
after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs.
Hypothesis 1: Iran has already stated that Kharazi’s statements to the conference
because of the Jordanian King Abdullah II in which he stood accused Iran of
interfering in Iraqi affairs.
n-gram matches: 27 unigrams, 20 bigrams, 15 trigrams, and ten 4-grams
human scores: Adequacy:3,2 Fluency:3,2
Hypothesis 2: Iran already announced that Kharrazi will not attend the
conference because of the statements made by the Jordanian Monarch Abdullah
II who has accused Iran of interfering in Iraqi affairs.
n-gram matches: 24 unigrams, 19 bigrams, 15 trigrams, and 12 4-grams
human scores: Adequacy:5,4 Fluency:5,4
22
Systran v. SMT
4.5
Adequacy
Fluency
4 SMT System 1
Systran (full training set)
Human Score
3.5
3
SMT System 2
(small training set)
2.5
2
0.18 0.2 0.22 0.24 0.26 0.28 0.3
Bleu Score
23
Implications for Research

• Higher Bleu score does not guarantee genuine improvement in translation
quality
• It is therefore inappropriate and insufficient to:

– Run workshops to compare systems using Bleu alone
– Compare systems which employ heterogeneous strategies using Bleu
– Report translation improvements in conference papers without examples and
manual verification
– Dismiss research which fails to improve Bleu as not improving translation
quality
24
Conclusions
• We have shown:
– Increasing Bleu is insufficient to guarantee genuine improvements
– Increasing Bleu is unnecessary to have actual improvements
• Breaks our fundamental assumption that Bleu correlates with human judgments
• Implies that current methodology for evaluation of MT research is flawed
• We must develop a new evaluation methodology
25
Thank you!
26
What Should We Do Instead?

• Human evaluation
• Careful experimental design with clear, testable hypothesis
• Focused manual evaluation to see whether it is true
• Show examples in papers
• Publish all translations online
27
When Can We Use Bleu?

• To compare different versions during system development
• As an objective function for minimum error rate training
• As a “sanity check” prior to doing human evaluation
28
Other Known Deficiencies of Bleu

• Scores hard to interpret
• Different number of references lead to radically different scores
• Does not work on a per sentence level
• No weight given to content-bearing words

Re-Evaluating The Role of in Machine Translation Research

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Re-Evaluating The Role of in Machine Translation Research

Uploaded by

Copyright:

Available Formats

Re-evaluating the Role of Bleu

in Machine Translation Research

• What assumptions does our methodology rely on?

• Are those assumptions valid?

• If not, what does that imply for the field?

• Change the behavior of a translation system accordingly

• Translate a set of test sentences

• Compare translations before and after change

• If better, then write a paper

• Can be done manually by judging a translation’s fluency and adequacy

Human v. Automatic Evaluation

• We would like an automatic metric

• Accomplished by comparing to references

Difficulties of Automatic Evaluation of MT

• By contrast, translation has a range of possible realizations

• Allows multiple reference translations as an attempt to model the variety of

• Matches n-grams from reference without putting explicit constraints on order

• Has been shown to correlate with human judgments of translation quality

• Uses brevity penalty to compensate for lack of recall by penalizing translations

• Bleu is defined as weighted geometric average of pn offset by BP

Common Assumptions About Bleu

• Sometimes failure to improve Bleu is taken as failure to improve translation

• This relies on two key assumptions:

Are These Assumptions Valid?

• We do this in two ways

Equally Scoring Translations: Permutations

Appeared calm | when | he was | taken | to the American plane | , | which

• So this and 40,320 other candidates receive the same score:

which will | he was | , | when | taken | Appeared calm | to the American

• Current systems produce translations with millions of similarly scoring

Equally Scoring Translations: Substitutions

• Unmatched words (when, taken) can be replaced by anything (black,

• Bleu’s model of allowable variation in translation is insufficient to distinguish

• Bleu cannot be guaranteed to correlate with human judgments

• These failures happen in practice too

NIST 2005 Results

NIST 2005 Results

NIST 2005 Results

Implications for Research

• It is therefore inappropriate and insufficient to:

• Implies that current methodology for evaluation of MT research is flawed

• We must develop a new evaluation methodology

What Should We Do Instead?

• Careful experimental design with clear, testable hypothesis

• Focused manual evaluation to see whether it is true

• Show examples in papers

• Publish all translations online

When Can We Use Bleu?

• As an objective function for minimum error rate training

• As a “sanity check” prior to doing human evaluation

Other Known Deficiencies of Bleu

• Different number of references lead to radically different scores

• Does not work on a per sentence level

• No weight given to content-bearing words

You might also like