You are on page 1of 29

Re-evaluating the Role of Bleu

in Machine Translation Research


Chris Callison-Burch,
Miles Osborne and Philipp Koehn

April 7, 2006

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
1

Talk Overview
• How do we currently evaluate MT research?

• What assumptions does our methodology rely on?

• Are those assumptions valid?

• If not, what does that imply for the field?

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
2

Conducting Research in MT
• Posit theory of how to improve translation quality

• Change the behavior of a translation system accordingly

• Translate a set of test sentences

• Compare translations before and after change

• If better, then write a paper

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
3

Determining Goodness
• To determine if translation improved, we need to measure translation quality

• Can be done manually by judging a translation’s fluency and adequacy


Fluency Adequacy
5. Flawless English 5. All
4. Good English 4. Most
3. Broken English 3. Much
2. Disfluent 2. Some
1. Incomprehensible 1. None

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
4

Human v. Automatic Evaluation


• Human evaluation is accurate, but
– It’s time consuming
– It’s expensive
– It’s not easy to re-use

• We would like an automatic metric


– Which can be run quickly at no cost
– Which correlates with human judgments

• Accomplished by comparing to references

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
5

Difficulties of Automatic Evaluation of MT


• Different than Word Error Rate metric used in speech recognition
– WER assumes a single authoritative reference
– WER assumes linear ordering

• By contrast, translation has a range of possible realizations


– A variety of equally valid wordings
– Some phrases can be moved

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
6

Enter: Bleu
• “Bi-Lingual Evaluation Understudy”

• Allows multiple reference translations as an attempt to model the variety of


possible translations

• Matches n-grams from reference without putting explicit constraints on order

• Has been shown to correlate with human judgments of translation quality

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
7

Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams:
plane that would take him to Miami in Florida . 2-grams:
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
8

Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams:
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
9

Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams: 10
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
10

Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams: 10
Rodriguez appeared calm as he was being led to the American 3-grams: 7
plane that was to carry him to Miami in Florida . 4-grams: 3
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
11

Bleu Detailed
• Calculates n-gram precision pn for n = 1, 2, 3, 4 ... by summing over n-gram
matches for every hypothesis translation in test set

• Uses brevity penalty to compensate for lack of recall by penalizing translations


that are too short !
1 if h > r
BP =
e1−r/h if h ≤ r

• Bleu is defined as weighted geometric average of pn offset by BP


N
"
Bleu = BP ∗ exp( wn logpn)
n=1

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
12

Common Assumptions About Bleu


• Bleu is commonly reported as sole evidence of improved translation quality
in conference papers

• Sometimes failure to improve Bleu is taken as failure to improve translation


quality (see “Word Sense Disambiguation v. SMT”)

• This relies on two key assumptions:


– Accurately accounts for allowable variation in translation
– Correlates with human judgments

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
13

Are These Assumptions Valid?


• Does an improvement in Bleu score guarantee a genuine translation
improvement?

• Does a failure to improve Bleu always mean that translation quality has not
improved?

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
14

Not Always
• We show that in some cases a higher Bleu score is neither sufficient nor
necessary to ensure genuine translation improvement.

• We do this in two ways


1. By showing that Bleu has poor model of allowable variation and fails to
distinguish between translations of differing quality
2. By showing two significant counterexamples to Bleu’s correlation with human
judgments

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
15

Equally Scoring Translations: Permutations


• Because Bleu does not constrain order of n-grams, we can construct equal
scoring translations by permuting around bigram mismatch points

Appeared calm | when | he was | taken | to the American plane | , | which


will | to Miami , Florida .

• So this and 40,320 other candidates receive the same score:

which will | he was | , | when | taken | Appeared calm | to the American


plane | to Miami , Florida .

• Current systems produce translations with millions of similarly scoring


permutations, up to 1073. Likely to be judged equally valid?

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
16

Equally Scoring Translations: Substitutions


• Different items may be drawn from references and receive the same score

was being led to the | calm as he was | would take | carry him | seemed
quite | when | taken

• Unmatched words (when, taken) can be replaced by anything (black,


helicopters)

• Bleu’s model of allowable variation in translation is insufficient to distinguish


between different quality translations

• Bleu cannot be guaranteed to correlate with human judgments

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
17

Failures in Practice
• Criticism: Those are constructed examples, Bleu assumes cooperative
environment

• These failures happen in practice too


– In the 2005 NIST MT Eval, the 6th ranked Bleu system scored 1st in the
manual human evaluation
– Bleu incorrectly ranks poor phrase-based MT system higher than good
rule-based system

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
18

NIST 2005 Results


4
Adequacy
Correlation

3.5
Human Score

2.5

2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
19

NIST 2005 Results


4
Adequacy
Correlation

3.5
Human Score

2.5

2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
20

NIST 2005 Results


4
Fluency
Correlation

3.5
Human Score

2.5

2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
21

Example
Reference: Iran had already announced Kharazi would boycott the conference
after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs.
Hypothesis 1: Iran has already stated that Kharazi’s statements to the conference
because of the Jordanian King Abdullah II in which he stood accused Iran of
interfering in Iraqi affairs.
n-gram matches: 27 unigrams, 20 bigrams, 15 trigrams, and ten 4-grams
human scores: Adequacy:3,2 Fluency:3,2
Hypothesis 2: Iran already announced that Kharrazi will not attend the
conference because of the statements made by the Jordanian Monarch Abdullah
II who has accused Iran of interfering in Iraqi affairs.
n-gram matches: 24 unigrams, 19 bigrams, 15 trigrams, and 12 4-grams
human scores: Adequacy:5,4 Fluency:5,4

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
22

Systran v. SMT
4.5
Adequacy
Fluency
4 SMT System 1
Systran (full training set)
Human Score

3.5

3
SMT System 2
(small training set)
2.5

2
0.18 0.2 0.22 0.24 0.26 0.28 0.3
Bleu Score

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
23

Implications for Research


• Higher Bleu score does not guarantee genuine improvement in translation
quality

• It is therefore inappropriate and insufficient to:


– Run workshops to compare systems using Bleu alone
– Compare systems which employ heterogeneous strategies using Bleu
– Report translation improvements in conference papers without examples and
manual verification
– Dismiss research which fails to improve Bleu as not improving translation
quality

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
24

Conclusions
• We have shown:
– Increasing Bleu is insufficient to guarantee genuine improvements
– Increasing Bleu is unnecessary to have actual improvements

• Breaks our fundamental assumption that Bleu correlates with human judgments

• Implies that current methodology for evaluation of MT research is flawed

• We must develop a new evaluation methodology

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
25

Thank you!

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
26

What Should We Do Instead?


• Human evaluation

• Careful experimental design with clear, testable hypothesis

• Focused manual evaluation to see whether it is true

• Show examples in papers

• Publish all translations online

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
27

When Can We Use Bleu?


• To compare different versions during system development

• As an objective function for minimum error rate training

• As a “sanity check” prior to doing human evaluation

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
28

Other Known Deficiencies of Bleu


• Scores hard to interpret

• Different number of references lead to radically different scores

• Does not work on a per sentence level

• No weight given to content-bearing words

Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006

You might also like