Professional Documents
Culture Documents
Due to the wealth of testing conditions, a sim- ter TERp), but was the sixth ranked metric accord-
ple overall view of the official MATR08 results re- ing to the Pearson correlation. In several cases,
leased by NIST is difficult. To facilitate this anal- TERp was not the best metric (if a metric was the
ysis, we examined the average rank of each metric best in all conditions, its average rank would be 1),
across all conditions, where the rank was deter- although it performed well on average. In partic-
mined by their Pearson and Spearman correlation ular, TERp did significantly better than the TER
with human judgments. To incorporate statistical metric, indicating the benefit of the enhancements
significance, we calculated the 95% confidence in- made to TER.
terval for each correlation coefficient and found
the highest and lowest rank from which the cor- 4 Paraphrases
relation coefficient was statistically indistinguish- TERp uses probabilistic phrasal substitutions to
able, resulting in lower and upper bounds of the align phrases in the hypothesis with phrases in the
rank for each metric in each condition. The aver- reference. It does so by looking up—in a pre-
age lower bound, actual, and upper bound ranks computed phrase table—paraphrases of phrases in
(where a rank of 1 indicates the highest correla- the reference and using its associated edit cost as
tion) of the top metrics, as well as BLEU and TER, the cost of performing a match against the hy-
are shown in Table 4, sorted by the average upper pothesis. The paraphrases used in TERp were ex-
bound Pearson correlation. Full descriptions of the tracted using the pivot-based method as described
other metrics3 , the evaluation results, and the test in (Bannard and Callison-Burch, 2005) with sev-
set composition are available from NIST (Przy- eral additional filtering mechanisms to increase
bocki et al., 2008). the precision. The pivot-based method utilizes the
This analysis shows that TERp was consistently inherent monolingual semantic knowledge from
one of the top metrics across test conditions and bilingual corpora: we first identify English-to-F
had the highest average rank both in terms of Pear- phrasal correspondences, then map from English
son and Spearman correlations. While this anal- to English by following translation units from En-
ysis is not comprehensive, it does give a general glish to F and back. For example, if the two En-
idea of the performance of all metrics by syn- glish phrases e1 and e2 both correspond to the
thesizing the results into a single table. There same foreign phrase f, then they may be consid-
are striking differences between the Spearman and ered to be paraphrases of each other with the fol-
Pearson correlations for other metrics, in particu- lowing probability:
lar the CDER metric (Leusch et al., 2006) had the
second highest rank in Spearman correlations (af- p(e1|e2) ≈ p(e1|f ) ∗ p(f |e2)
3
System description of metrics are also distributed
by AMTA: http://www.amtaweb.org/AMTA2008. If there are several pivot phrases that link the two
html English phrases, then they are all used in comput-
Optimization Set Test Set Optimization+Test
Metric Seg Doc Sys Seg Doc Sys Seg Doc Sys
BLEU 0.635 0.816 0.714† 0.550 0.740 0.690† 0.606 0.794 0.738†
BLEU-2 0.643 0.823 0.786† 0.558 0.747 0.690† 0.614 0.799 0.738†
METEOR 0.729 0.886 0.881 0.727 0.853 0.738† 0.730 0.876 0.922
TER -0.630 -0.794 -0.810† -0.630 -0.797 -0.667† -0.631 -0.801 -0.786†
TERp -0.760 -0.834 -0.976 -0.737 -0.818 -0.881 -0.754 -0.834 -0.929
Table 3: MT06 Dev. Optimization & Test Set Spearman Correlation Results
Table 4: Average Metric Rank in NIST Metrics MATR 2008 Official Results
ing the probability: estimate of the degree to which the pair was se-
X mantically related. For example, we looked at all
p(e1|e2) ≈ p(e1|f 0 ) ∗ p(f 0 |e2) paraphrase pairs that had probabilities greater than
f0 0.9, a set that should ideally contain pairs that are
paraphrastic to a large degree. In our analysis, we
The corpus used for extraction was an Arabic-
found the following five kinds of paraphrases in
English newswire bitext containing a million sen-
this set:
tences. A few examples of the extracted para-
phrase pairs that were actually used in a run of (a) Lexical Paraphrases. These paraphrase
TERp on the Metrics MATR 2008 development pairs are not phrasal paraphrases but instead
set are shown below: differ in at most one word and may be con-
(brief → short) sidered as lexical paraphrases for all practical
(controversy over → polemic about) purposes. While these pairs may not be very
(by using power → by force) valuable for TERp due to the obvious overlap
(response → reaction) with WordNet, they may help in increasing
the coverage of the paraphrastic phenomena
A discussion of paraphrase quality is presented that TERp can handle. Here are some exam-
in Section 4.1, followed by a brief analysis of the ples:
effect of varying the pivot corpus used by the auto-
matic paraphrase generation upon the correlation (2500 polish troops → 2500 polish soldiers)
performance of the TERp metric in Section 4.2. (accounting firms → auditing firms)
(armed source → military source)
4.1 Analysis of Paraphrase Quality
We analyzed the utility of the paraphrase probabil- (b) Morphological Variants. These phrasal
ity and found that it was not always a very reliable pairs only differ in the morphological form
for one of the words. As the examples show, can see that this variant works as well as the full
any knowledge that these pairs may provide version of TERp that utilizes paraphrase probabil-
is already available to TERp via stemming. ities. This confirms our intuition that the proba-
bility computed via the pivot-method is not a very
(50 ton → 50 tons) useful predictor of semantic equivalence for use in
(caused clouds → causing clouds) TERp.
(syria deny → syria denies)
4.2 Varying Paraphrase Pivot Corpora
(c) Approximate Phrasal Paraphrases. This To determine the effect that the pivot language
set included pairs that only shared partial se- might have on the quality and utility of the ex-
mantic content. Most paraphrases extracted tracted paraphrases in TERp, we used paraphrase
by the pivot method are expected to be of this pairsmade available by Callison-Burch (2008).
nature. These pairs are not directly beneficial These paraphrase pairs were extracted from Eu-
to TERp since they cannot be substituted for roparl data using each of 10 European languages
each other in all contexts. However, the fact (German, Italian, French etc.) as a pivot language
that they share at least some semantic content separately and then combining the extracted para-
does suggest that they may not be entirely phrase pairs. Callison-Burch (2008) also extracted
useless either. Examples include: and made available syntactically constrained para-
phrase pairs from the same data that are more
(mutual proposal → suggest)
likely to be semantically related.
(them were exiled → them abroad)
We used both sets of paraphrases in TERp as al-
(my parents → my father)
ternatives to the paraphrase pairs that we extracted
from the Arabic newswire bitext. The results are
(d) Phrasal Paraphrases. We did indeed find
shown in the last four rows of Table 5 and show
a large number of pairs in this set that were
that using a pivot language other than the one that
truly paraphrastic and proved the most useful
the MT system is actually translating yields results
for TERp. For example:
that are almost as good. It also shows that the
(agence presse → news agency) syntactic constraints imposed by Callison-Burch
(army roadblock → military barrier) (2008) on the pivot-based paraphrase extraction
(staff walked out → team withdrew) process are useful and yield improved results over
the baseline pivot-method. The results further sup-
(e) Noisy Co-occurrences. There are also pairs port our claim that the pivot paraphrase probability
that are completely unrelated and happen is not a very useful indicator of semantic related-
to be extracted as paraphrases based on the ness.
noise inherent in the pivoting process. These
pairs are much smaller in number than the 5 Varying Human Judgments
four sets described above and are not signif- To evaluate the differences between human judg-
icantly detrimental to TERp since they are ment types we first align the hypothesis to the ref-
rarely chosen for phrasal substitution. Exam- erences using a fixed set of edit costs, identical to
ples: the weights in Table 1, and then optimize the edit
costs to maximize the correlation, without realign-
(counterpart salam → peace) ing. The separation of the edit costs used for align-
(regulation dealing → list) ment from those used for scoring allows us to re-
(recall one → deported) move the confusion of edit costs selected for align-
ment purposes from those selected to increase cor-
Given this distribution of the pivot-based para-
relation.
phrases, we experimented with a variant of TERp
For Adequacy and Fluency judgments, the
that did not use the paraphrase probability at all
MTEval 2002 human judgement set4 was used.
but instead only used the actual edit distance be-
This set consists of the output of ten MT sys-
tween the two phrases to determine the final cost
tems, 3 Arabic-to-English systems and 7 Chinese-
of a phrase substitution. The results for this exper-
4
iment are shown in the second row of Table 5. We Distributed to the authors by request from NIST.
Pearson Spearman
Paraphrase Setup Seg Doc Sys Seg Doc Sys
Arabic pivot -0.787 -0.918 -0.985 -0.737 -0.818 -0.881
Arabic pivot and no prob -0.787 -0.933 -0.986 -0.737 -0.841 -0.881
Europarl pivot -0.775 -0.940 -0.983 -0.738 -0.865 -0.905
Europarl pivot and no prob -0.775 -0.940 -0.983 -0.737 -0.860 -0.905
Europarl pivot and syntactic constraints -0.781 -0.941 -0.985 -0.739 -0.859 -0.881
Europarl pivot, syntactic constraints and no prob -0.779 -0.946 -0.985 -0.737 -0.866 -0.976
Table 5: Results on the NIST MATR 2008 test set for several variations of paraphrase usage.
to-English systems, consisting of a total, across to a match for both Adequacy and Fluency. This
all systems and both language pairs, of 7,452 seg- penalty against stem matches can be attributed to
ments across 900 documents. To evaluate HTER, Fluency requirements in HTER that specifically
the GALE (Olive, 2005) 2007 (Phase 2.0) HTER penalize against incorrect morphology. The cost
scores were used. This set consists of the out- of shifts is also increased in HTER, strongly penal-
put of 6 MT systems, 3 Arabic-to-English systems izing the movement of phrases within the hypoth-
and 3 Chinese-to-English systems, although each esis, while Adequacy and Fluency give a much
of the systems in question is the product of system lower cost to such errors. Some of the differences
combination. The HTER data consisted of a total, between HTER and both fluency and adequacy
across all systems and language pairs, of 16,267 can be attributed to the different systems used. The
segments across a total of 1,568 documents. Be- MT systems evaluated with HTER are all highly
cause HTER annotation is especially expensive performing state of the art systems, while the sys-
and difficult, it is rarely performed, and the only tems used for adequacy and fluency are older MT
source, to the authors’ knowledge, of available systems.
HTER annotations is on GALE evaluation data for
which no Fluency and Adequacy judgments have
been made publicly available.
The edit costs learned for each of these human The differences between Adequacy and Fluency
judgments, along with the alignment edit costs are are smaller, but there are still significant differ-
shown in Table 6. While all three types of human ences. In particular, the cost of shifts is over twice
judgements differ from the alignment costs used as high for the fluency optimized system than the
in alignment, the HTER edit costs differ most sig- adequacy optimized system, indicating that the
nificantly. Unlike Adequacy and Fluency which movement of phrases, as expected, is only slightly
have a low edit cost for insertions and a very high penalized when judging meaning, but can be much
cost for deletions, HTER has a balanced cost for more harmful to the fluency of a translation. Flu-
the two edit types. Inserted words are strongly pe- ency however favors paraphrases more strongly
nalized against in HTER, as opposed to in Ade- than the edit costs optimized for adequacy. This
quacy and Fluency, where such errors are largely might indicate that paraphrases are used to gener-
forgiven. Stem and synonym edits are also penal- ate a more fluent translation although at the poten-
ized against while these are considered equivalent tial loss of meaning.
6 Discussion man judgment types, as well as the small num-
ber of edit types used by TERp. The inclusion
We introduced a new evaluation metric, TER-Plus, of additional more specific edit types could lead
and showed that it is competitive with state-of-the- to a more detailed understanding of which trans-
art evaluation metrics when its predictions are cor- lation phenomenon and translation errors are most
related with human judgments. The inclusion of emphasized or ignored by which types of human
stem, synonym and paraphrase edits allows TERp judgments.
to overcome some of the weaknesses of the TER
metric and better align hypothesized translations Acknowledgments
with reference translations. These new edit costs
can then be optimized to allow better correlation This work was supported, in part, by BBN Tech-
with human judgments. In addition, we have ex- nologies under the GALE Program, DARPA/IPTO
amined the use of other paraphrasing techniques, Contract No. HR0011-06-C-0022 and in part by
and shown that the paraphrase probabilities esti- the Human Language Technology Center of Ex-
mated by the pivot-method may not be fully ad- cellence.. TERp is available on the web for down-
equate for judgments of whether a paraphrase in load at: http://www.umiacs.umd.edu/∼snover/terp/.
a translation indicates a correct translation. This
line of research holds promise as an external eval-
uation method of various paraphrasing methods. References
However promising correlation results for an Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
evaluation metric may be, the evaluation of the An Automatic Metric for MT Evaluation with Im-
final output of an MT system is only a portion proved Correlation with Human Judgments. In Pro-
ceedings of the ACL 2005 Workshop on Intrinsic and
of the utility of an automatic translation metric. Extrinsic Evaulation Measures for MT and/or Sum-
Optimization of the parameters of an MT system marization.
is now done using automatic metrics, primarily
BLEU. It is likely that some features that make an Colin Bannard and Chris Callison-Burch. 2005. Para-
evaluation metric good for evaluating the final out- phrasing with Bilingual Parallel Corpora. In Pro-
ceedings of the 43rd Annual Meeting of the Asso-
put of a system would make it a poor metric for use ciation for Computational Linguistics (ACL 2005),
in system tuning. In particular, a metric may have pages 597–604, Ann Arbor, Michigan, June.
difficulty distinguishing between outputs of an MT
system that been optimized for that same metric. Chris Callison-Burch. 2008. Syntactic constraints on
paraphrases extracted from parallel corpora. In Pro-
BLEU, the metric most frequently used to opti- ceedings of the 2008 Conference on Empirical Meth-
mize systems, might therefore perform poorly in ods in Natural Language Processing, pages 196–
evaluation tasks compared to recall oriented met- 205, Honolulu, Hawaii, October. Association for
rics such as METEOR and TERp (whose tuning Computational Linguistics.
in Table 1 indicates a preference towards recall).
Christiane Fellbaum. 1998. WordNet: An
Future research into the use of TERp and other Electronic Lexical Database. MIT Press.
metrics as optimization metrics is needed to better http://www.cogsci.princeton.edu/˜wn [2000,
understand these metrics and the interaction with September 7].
parameter optimization.
David Kauchak and Regina Barzilay. 2006. Para-
Finally, we explored the difference between phrasing for Automatic Evaluation. In Proceedings
three types of human judgments that are often of the Human Language Technology Conference of
used to evaluate both MT systems and automatic the North American Chapter of the ACL, pages 455–
metrics, by optimizing TERp to these human 462.
judgments and examining the resulting edit costs.
Gregor Leusch, Nicola Ueffing, and Hermann Ney.
While this can make no judgement as to the pref- 2006. CDER: Efficient MT Evaluation Using Block
erence of one type of human judgment over an- Movements. In Proceedings of the 11th Confer-
other, it indicates differences between these hu- enceof the European Chapter of the Association for
man judgment types, and in particular the differ- Computational Linguistics (EACL 2006).
ence between HTER and Adequacy and Fluency. V. I. Levenshtein. 1966. Binary Codes Capable of Cor-
This exploration is limited by the the lack of a recting Deletions, Insertions, and Reversals. Soviet
large amount of diverse data annotated for all hu- Physics Doklady, 10:707–710.
Daniel Lopresti and Andrew Tomkins. 1997. Block
edit models for approximate string matching. Theo-
retical Computer Science, 181(1):159–179, July.
Nitin Madnani, Necip Fazil Ayan, Philip Resnik, and
Bonnie J. Dorr. 2007. Using paraphrases for pa-
rameter tuning in statistical machine translation. In
Proceedings of the Workshop on Statistical Machine
Translation, Prague, Czech Republic, June. Associ-
ation for Computational Linguistics.
Nitin Madnani, Philip Resnik, Bonnie J. Dorr, and
Richard Schwartz. 2008. Are Multiple Reference
Translations Necessary? Investigating the Value
of Paraphrased Reference Translations in Parameter
Optimization. In Proceedings of the Eighth Confer-
ence of the Association for Machine Translation in
the Americas, October.
S. Niessen, F.J. Och, G. Leusch, and H. Ney. 2000. An
evaluation tool for machine translation: Fast evalua-
tion for MT research. In Proceedings of the 2nd In-
ternational Conference on Language Resources and
Evaluation (LREC-2000), pages 39–45.
Joseph Olive. 2005. Global Autonomous Language
Exploitation (GALE). DARPA/IPTO Proposer In-
formation Pamphlet.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a Method for Automatic
Evaluation of Machine Traslation. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics.
Martin F. Porter. 1980. An algorithm for suffic strip-
ping. Program, 14(3):130–137.
Mark Przybocki, Kay Peterson, and Sebas-
tian Bronsart. 2008. Official results
of the NIST 2008 ”Metrics for MAchine
TRanslation” Challenge (MetricsMATR08).
http://nist.gov/speech/tests/metricsmatr/2008/results/,
October.
Antti-Veikko Rosti, Spyros Matsoukas, and Richard
Schwartz. 2007. Improved word-level system com-
bination for machine translation. In Proceedings
of the 45th Annual Meeting of the Association of
Computational Linguistics, pages 312–319, Prague,
Czech Republic, June. Association for Computa-
tional Linguistics.