Professional Documents
Culture Documents
April 7, 2006
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
1
Talk Overview
• How do we currently evaluate MT research?
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
2
Conducting Research in MT
• Posit theory of how to improve translation quality
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
3
Determining Goodness
• To determine if translation improved, we need to measure translation quality
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
4
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
5
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
6
Enter: Bleu
• “Bi-Lingual Evaluation Understudy”
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
7
Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams:
plane that would take him to Miami in Florida . 2-grams:
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
8
Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams:
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
9
Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams: 10
Rodriguez appeared calm as he was being led to the American 3-grams:
plane that was to carry him to Miami in Florida . 4-grams:
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
10
Bleu Detailed
References: Matches:
Rodriguez seemed quite calm as he was being led to the American 1-grams: 15
plane that would take him to Miami in Florida . 2-grams: 10
Rodriguez appeared calm as he was being led to the American 3-grams: 7
plane that was to carry him to Miami in Florida . 4-grams: 3
Rodriguez appeared calm as he was led to the American plane
which will take him to Miami , Florida .
Rodriguez appeared calm while being escorted to the plane that
would take him to Miami , Florida .
Hypothesis:
Appeared calm when he was taken to the American plane , which will to Miami ,
Florida .
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
11
Bleu Detailed
• Calculates n-gram precision pn for n = 1, 2, 3, 4 ... by summing over n-gram
matches for every hypothesis translation in test set
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
12
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
13
• Does a failure to improve Bleu always mean that translation quality has not
improved?
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
14
Not Always
• We show that in some cases a higher Bleu score is neither sufficient nor
necessary to ensure genuine translation improvement.
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
15
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
16
was being led to the | calm as he was | would take | carry him | seemed
quite | when | taken
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
17
Failures in Practice
• Criticism: Those are constructed examples, Bleu assumes cooperative
environment
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
18
3.5
Human Score
2.5
2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
19
3.5
Human Score
2.5
2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
20
3.5
Human Score
2.5
2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
21
Example
Reference: Iran had already announced Kharazi would boycott the conference
after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs.
Hypothesis 1: Iran has already stated that Kharazi’s statements to the conference
because of the Jordanian King Abdullah II in which he stood accused Iran of
interfering in Iraqi affairs.
n-gram matches: 27 unigrams, 20 bigrams, 15 trigrams, and ten 4-grams
human scores: Adequacy:3,2 Fluency:3,2
Hypothesis 2: Iran already announced that Kharrazi will not attend the
conference because of the statements made by the Jordanian Monarch Abdullah
II who has accused Iran of interfering in Iraqi affairs.
n-gram matches: 24 unigrams, 19 bigrams, 15 trigrams, and 12 4-grams
human scores: Adequacy:5,4 Fluency:5,4
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
22
Systran v. SMT
4.5
Adequacy
Fluency
4 SMT System 1
Systran (full training set)
Human Score
3.5
3
SMT System 2
(small training set)
2.5
2
0.18 0.2 0.22 0.24 0.26 0.28 0.3
Bleu Score
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
23
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
24
Conclusions
• We have shown:
– Increasing Bleu is insufficient to guarantee genuine improvements
– Increasing Bleu is unnecessary to have actual improvements
• Breaks our fundamental assumption that Bleu correlates with human judgments
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
25
Thank you!
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
26
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
27
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006
28
Callison-Burch, Osborne and Koehn Re-evaluating the Role of Bleu April 7, 2006