Professional Documents
Culture Documents
Improving Evaluation of Machine Translation Quality Estimation
Improving Evaluation of Machine Translation Quality Estimation
Yvette Graham
ADAPT Centre
School of Computer Science and Statistics
Trinity College Dublin
yvette.graham@scss.tcd.ie
1.0
1.0
MAE: 0.687 MAE: 0.603
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
1 2 3 1 2 3 0.5 1.0 1.5 2.0 2.5 3.0 3.5
(a) Gold Labels (b) System A Predictions (c) System B Predictions
Figure 1: W MT-14 English-to-German quality estimation Task 1.1 where mismatched prediction/gold
labels achieves apparent better performance, where (a) gold label distribution; (b) example system disad-
vantaged by its discrete predictions; (c) example system gaining advantage by its continuous predictions.
tion systems, as scores are individually computed tive variance can achieve lower MAE and appar-
per translation using custom post-edited reference ent better performance. Figure 2(b) shows a lower
translations, avoiding the bias that can occur with MAE can be achieved by rescaling the original
metrics that employ generic reference translations. prediction distribution for an example system to
In our later evaluation, we therefore use HTER a distribution with lower variance.
scores in addition to PERs as suitable gold stan- A disadvantage of an ability to gain in perfor-
dard labels. mance by prediction of such features of a given
test set is that prediction of aggregates is, in gen-
2.2 Mean Absolute Error eral, far easier than individual predictions. In ad-
dition, inclusion of confounding test set aggre-
Mean absolute error is likely the most widely ap-
gates such as these in evaluations will likely lead
plied comparison measure of quality estimation
to both an overestimate of the ability of some sys-
system predictions and gold labels, in addition to
tems to predict the quality of unseen translations
being the official measure applied to scoring vari-
and an underestimate of the accuracy of systems
ants of tasks in WMT (Bojar et al., 2014). MAE is
that courageously attempt to predict the quality of
the average absolute difference that exists between
translations in the tails of gold distributions, and
a system’s predictions and gold standard labels for
it follows that systems optimized for MAE can
translations, and a system achieving a lower MAE
be expected to perform badly when predicting the
is considered a better system. Significant issues
quality of translations in the tails of gold label dis-
arise for evaluation of quality estimation systems
tributions (Moreau and Vogel, 2014).
with MAE when comparing distributions for pre-
dictions and gold labels, however. Firstly, a sys- Table 2 shows how MAEs of original predicted
tem’s MAE can be lowered not only by individ- score distributions for all systems participating in
ual predictions closer to corresponding gold la- Task 1.2 W MT-14 can be reduced by shifting and
bels, but also by prediction of aggregate statistics rescaling the prediction score distribution accord-
specific to the distribution of gold labels in the par- ing to gold label aggregates. Table 3 shows that
ticular test set used for evaluation. MAE is most for similar reasons other measures commonly ap-
susceptible in this respect when gold labels have plied to evaluation of quality estimation systems,
a unimodal distribution with relatively low stan- such as root mean squared error (RMSE), that are
dard deviation. For example, Figure 2(a) shows also not unit-free, encounter the same problem.
test set gold label HTER distribution for Task 1.2
2.3 Significance Testing
in W MT-14 where the bulk of HTERs are located
around one main peak with relatively low variance In quality estimation, it is common to apply boot-
in the distribution. Unfortunately with MAE, a strap resampling to assess the likelihood that a de-
system that correctly predicts the location of the crease in MAE (an improvement) has occurred by
mode of the test set gold distribution and centers chance. In contrast to other areas of MT, where
predictions around it with an optimally conserva- the accuracy of randomized methods of signifi-
Original Rescaled
(a) Original Predictions
RMSE RMSE
FBK-UPV-UEDIN-wp 0.167 0.166
MAE: 0.1504 DCU-rtm-svr 0.167 0.165
6
5 DCU-rtm-tree 0.175 0.169
Multilizer DFKI-svr 0.177 0.171
Gold USHEFF 0.178 0.178
4
distribution.
5
Rescaled Multilizer
Gold
4
1.0
4
MAE: 0.148 MAE: 0.148 r: 0.451
7
6
0.8
3
Gold
HTER Gold (raw)
2
0.6
Density
4
1
0.4
0
2
0.2
−1
1
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4
0.0 0.2 0.4 0.6 0.8 1.0
HTER Prediction (raw) Prediction HTER (z)
HTER
4
MAE: 0.152 MAE: 0.152 r: 0.494
7
6
0.8
3
Gold
HTER Gold (raw)
2
0.6
Density
4
1
0.4
0
2
0.2
−1
1
0.0
Figure 3: W MT-13 Task 1.1 systems showing baseline with better MAE than CMU-ISL- FULL only due
to conservative variance in prediction distribution and despite its weaker correlation with gold labels.
diction score distribution, as can be seen by the Finally, when evaluated with the Pearson cor-
narrow blue spike in Figure 3(b). Figure 3(e) relation significance tests can be applied without
shows how the prediction distribution of CMU- resorting to randomized methods, in addition to
ISL- FULL , on the other hand, has higher variance, taking into account the dependent nature of data
and subsequently higher MAE. Figures 3(c) and used in evaluations.
3(f) depict what occurs in computation of the Pear-
son correlation where raw prediction and gold la- The fact that the Pearson correlation is invariant
bel scores are replaced by standardized scores, i.e. to separate shifts in location and scale of either of
numbers of standard deviations from the mean of the two variables is nonproblematic for evaluation
each distribution, where CMU-ISL- FULL in fact of quality estimation systems. Take, for instance,
achieves a significantly higher correlation than the the possible counter-argument: a pair of systems,
baseline system at p < 0.001. one of which predicts the precise gold distribution,
and another system predicting the gold distribution
An additional advantage of the Pearson corre- + 1, would unfairly receive the same Pearson cor-
lation is that coefficients do not change depend- relation coefficient. Firstly, it is just as difficult to
ing on the representation used in the gold standard predict the gold distribution + 1, as it is to pre-
in the way they do with MAE, making possible a dict the gold distribution itself. More importantly,
comparison of performance across evaluations that however, the scenario is extremely unlikely to oc-
employ different gold label representations. Addi- cur in practice, it is highly unlikely that a system
tionally, there is no longer a need for training and would ever accurately predict the gold distribution
test representations to directly correspond to one + 1, as opposed to the actual gold distribution un-
another. To demonstrate, in our later evaluation we less training labels were adjusted in the same man-
include the evaluation of systems trained on both ner, or indeed predict the gold distribution shifted
HTER and PETs for prediction of both HTER or rescaled by any other constant value. It is im-
and PERs. portant to understand that invariance of the Pear-
son correlation to a shift in location or scale means independent data sets. In such cases, the Fisher
that the measure is only invariant to a shift in loca- r to z transformation is applied to test for sig-
tion or scale applied to the entire distribution (of nificant differences in correlations. Data used
either of the two variables), such as the shift in for evaluation of quality estimation systems are
location and scale that can be used to boost appar- not independent, however, and this means that if
ent performance of systems when measures like r(Pbase , G) and r(Pnew , G) are both > 0, the cor-
MAE and RMSE, that are not unit-free, are em- relation between both sets of predictions them-
ployed. Increasing the distance between system selves, r(Pbase , Pnew ), must also be > 0. The
predictions and gold labels for anything less than strength of this correlation, directly between pre-
the entire distribution, a more realistic scenario, dictions of pairs of quality estimation systems,
or by something other than a constant across the should be taken into account using a signifi-
entire distribution, will result in an appropriately cance test of the difference in correlation between
weaker Pearson correlation. r(Pbase , G) and r(Pnew , G).
Williams test 1 (Williams, 1959) evaluates the
4 Quality Estimation Significance significance of a difference in dependent correla-
Testing tions (Steiger, 1980). It is formulated as follows
as a test of whether the population correlation be-
Previous work has shown the suitability of tween X1 and X3 equals the population correla-
Williams significance test (Williams, 1959) for tion between X2 and X3 :
evaluation of automatic MT metrics (Graham and
p
Baldwin, 2014; Graham et al., 2015), and, for sim- (r13 − r23 ) (n − 1)(1 + r12 )
ilar reasons, Williams test is appropriate for signif- t(n − 3) = q ,
(r23 +r13 )2
icance testing differences in performance of com- 2K (n−1)
(n−3) + 4 (1 − r12 )3
and to demonstrate the ability of evaluation of sys- Table 5: Pearson correlation and MAE of system
tems trained on a representation distinct from that HTER predictions and gold labels for English-to-
of gold labels made possible by the unit-free Pear- Spanish W MT-14 Task 1.2 and 1.3 systems trained
son correlation, we also include evaluation of sys- on either HTER or PET labelled data.
tems originally trained on PET labels to predict
HTER scores. Since PET systems also produce
predictions in the form of PET, we convert pre-
dictions for all systems to PERs prior to compu- Training QE
tation of correlations, as PERs more closely cor- Labels System r MAE
respond to translation quality. Results reveal that HTER FBK-UPV-UEDIN-wp 0.529 −
systems originally trained on PETs in general per- PET FBK-UPV-UEDIN-wp 0.472 0.972
HTER FBK-UPV-UEDIN-nowp 0.452 −
form worse than HTER trained systems, and this HTER USHEFF 0.444 −
is not all that surprising considering the training HTER DCU-rtm-svr 0.444 −
HTER DCU-rtm-tree 0.442 −
representation did not correspond well to transla- HTER SHEFF-lite-sparse 0.441 −
tion quality. Again system rankings diverge from PET DCU-rtm-rr 0.430 0.932
MAE rankings with the second best system ac- PET FBK-UPV-UEDIN-nowp 0.423 1.012
HTER DFKI-svr 0.412 −
cording to MAE moved to the initial position. PET USHEFF 0.394 1.358
Table 6 shows Pearson correlations for predic- PET baseline 0.394 1.359
tions of PER for systems trained on either PETs PET DCU-rtm-svr 0.365 0.915
HTER Multilizer 0.361 −
or HTER, and predictions for systems trained on PET SHEFF-lite-sparse 0.337 0.951
PETs are converted to PER for evaluation. Sys- PET SHEFF-lite 0.323 0.940
PET Multilizer-1 0.288 0.993
tem rankings diverge most for this data set from HTER baseline 0.286 −
the original rankings by MAE, as the system hold- HTER DFKI-svr-xdata 0.277 −
ing initial position according to MAE moves to po- PET Multilizer-2 0.271 0.972
HTER SHEFF-lite 0.011 −
sition 13 according to the Pearson correlation.
Many of the differences in correlation between Table 6: Pearson correlation of system PER pre-
systems in Tables 4, 5 and 6 are small and instead dictions and gold labels for English-to-Spanish
of assuming that an increase in correlation of one W MT-14 Task 1.2 and 1.3 systems trained on ei-
system over another corresponds to an improve- ther HTER or PET labelled data, mean absolute
ment in performance, we first apply significance error (MAE) provided are in seconds per word.
testing to differences in correlation with gold la-
bels that exist between correlations for each pair
r
DCU−SYMC−rc
DCU−rtm−svr SHEFMIN−FS
DCU−SYMC−ra
FBK−UPV−UEDIN−wp CNGL−SVRPLS
DCU−rtm−tree CMU−ISL−noB
CNGL−SVR
DFKI−svr
CMU−ISL−full
USHEFF fbk−uedin−extra
LIMSI−ELASTIC
SHEFF−lite−sparse
SHEFMIN−FS−AL
FBK−UPV−UEDIN−nowp LORIA_INCTR−CONT
Multilizer fbk−uedin−rand−svr
LORIA−INCTRA
DFKI−svr−xdata baseline
baseline TCD−CNGL−OPEN
TCD−CNGL−RESTR
SHEFF−lite
UMAC_EBLEU
DCU.rtm.svr
FBK.UPV.UEDIN.wp
DCU.rtm.tree
DFKI.svr
USHEFF
SHEFF.lite.sparse
FBK.UPV.UEDIN.nowp
Multilizer
DFKI.svr.xdata
baseline
SHEFF.lite
DCU.SYMC.rc
SHEFMIN.FS
DCU.SYMC.ra
CNGL.SVRPLS
CMU.ISL.noB
CNGL.SVR
CMU.ISL.full
fbk.uedin.extra
LIMSI.ELASTIC
SHEFMIN.FS.AL
LORIA.INCTR.CONT
fbk.uedin.r.svr
LORIA_INCTR
baseline
TCD.CNGL.OPEN
TCD.CNGL.RESTR
UMAC_EBLEU
Figure 5: HTER prediction significance test out-
Figure 4: Pearson correlation between prediction comes for all pairs of systems from English-to-
scores for all pairs of systems participating in Spanish W MT-13 Task 1.1, colored cells denote a
W MT-14 Task 1.2 significant increase in correlation with gold labels
for row i system over column j system.
of systems.
in correlation with gold labels of one system over
5.1 Significance Tests
another, and subsequently the systems shown to
When an increase in correlation with gold labels outperform others. Test outcomes in Figure 5 re-
is present for a pair of systems, significance tests veal substantially increased conclusivity in sys-
provide insight into the likelihood that such an in- tem rankings made possible with the application
crease has occurred by chance. As described in of the Pearson correlation and Williams test, with
detail in Section 4, the Williams test (Williams, almost an unambiguous total-order ranking of sys-
1959), a test also appropriate for MT metrics eval- tems and an outright winner of the task.
uated by the Pearson correlation (Graham and Figure 6 shows outcomes of Williams signifi-
Baldwin, 2014), is appropriate for testing the sig- cance tests for prediction of HTER and Figure 7
nificance of a difference in dependent correlations shows outcomes of tests for PER prediction for
and therefore provides a suitable method of signif- W MT-14 English-to-Spanish, again showing sub-
icance testing for quality estimation systems. Fig- stantially increased conclusivity in system rank-
ure 4 provides an example of the strength of cor- ings for tasks.
relations that commonly exist between predictions It is important to note that the number of com-
of quality estimation systems. peting systems a system significantly outperforms
Figure 5 shows significance test outcomes of should not be used as the criterion for ranking
the Williams test for systems originally taking part competing quality estimation systems, since the
in W MT-13 Task 1.1, with systems ordered by power of the Williams test changes depending on
strongest to least Pearson correlation with gold la- the degree to which predictions of a pair of sys-
bels, where a green cell in (row i, column j) signi- tems correlate with each other. A system with pre-
fies a significant win for row i system over column dictions that happen to correlate strongly with pre-
j system, where darker shades of green signify dictions of many other systems would be at an un-
conclusions made with more certainty. Test out- fair advantage, were numbers of significant wins
comes allow identification of significant increases to be used to rank systems. For this reason, it is
best to interpret pairwise system tests in isolation.
6 Conclusion
HTER−DCU−rtm−svr
HTER−FBK−UPV−UEDIN−wp
HTER−DCU−rtm−tree
We have provided a critique of current widely used
HTER−DFKI−svr
HTER−USHEFF
methods of evaluation of quality estimation sys-
HTER−SHEFF−lite−sparse
HTER−FBK−UPV−UEDIN−nowp
tems and highlighted potential flaws in existing
HTER−Multilizer
PET−DCU−rtm−rr methods, with respect to the ability to boost scores
HTER−DFKI−svr−xdata
PET−FBK−UPV−UEDIN−wp by prediction of aggregate statistics specific to the
PET−Multilizer−2
PET−Multilizer−1 particular test set in use or conservative variance in
PET−DCU−rtm−svr
HTER−baseline prediction distributions. We provide an alternate
PET−FBK−UPV−UEDIN−nowp
PET−USHEFF mechanism, and since the Pearson correlation is a
PET−baseline
PET−SHEFF−lite−sparse
PET−SHEFF−lite
unit-free measure, it can be applied to evaluation
HTER−SHEFF−lite of quality estimation systems avoiding the previ-
ous vulnerabilities of measures such as MAE and
HTER.DCU.rtm.svr
HTER.FBK.UPV.UEDIN.wp
HTER.DCU.rtm.tree
HTER.DFKI.svr
HTER.USHEFF
HTER.SHEFF.lite.sparse
HTER.FBK.UPV.UEDIN.nowp
HTER.Multilizer
PET.DCU.rtm.rr
HTER.DFKI.svr.xdata
PET.FBK.UPV.UEDIN.wp
PET.Multilizer.2
PET.Multilizer.1
PET.DCU.rtm.svr
HTER.baseline
PET.FBK.UPV.UEDIN.nowp
PET.USHEFF
PET.baseline
PET.SHEFF.lite.sparse
PET.SHEFF.lite
HTER.SHEFF.lite