Professional Documents
Culture Documents
net/publication/269031800
CITATIONS READS
8 1,457
15 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mary Harper on 05 July 2015.
Eng-
Eng-
Eng-
Eng-
Eng-LF
Eng-LF
Eng-LF
Eng-LF
Eng-LF
Hybrid
Hybrid
Hybrid
Hybrid
Hybrid
func
func
func
func
func
Chinese
Chinese
Chinese
Chinese
Chinese
WHO WHAT WHEN WHERE WHY
Figure 1. System performance on each 5W. “Partial” indicates that part of the answer was missing. Dashed lines
show the performance of the best monolingual system (% Correct on human translations). For the last 3W’s, the
percent of answers that were Incorrect “by default” were: 30%, 24%, 27% and 22%, respectively, and 8% for the
best monolingual system
or incorrect answer, annotators could select one
5.4 5 W Corpus or more of these reasons:
The full evaluation corpus is 350 documents, • Wrong predicate or multiple predicates.
roughly evenly divided between four genres: • Answer contained another 5W answer.
formal text (newswire), informal text (blogs and • Passive handled wrong (WHO/WHAT).
newsgroups), formal speech (broadcast news) • Answer missed.
and informal speech (broadcast conversation). • Argument attached to wrong predicate.
For this analysis, we randomly sampled docu-
ments to judge from each of the genres. There Figure 1 shows the performance of the best
were 50 documents (249 sentences) that were monolingual system for each 5W as a dashed
judged by a single annotator. A subset of that set, line. The What question was the hardest, since it
with 22 documents and 103 sentences, was requires two pieces of information (the predicate
judged by two annotators. In comparing the re- and object). The When, Where and Why ques-
sults from one annotator to the results from both tions were easier, since they were null most of
annotators, we found substantial agreement. the time. (In English OntoNotes 2.0, 38% of sen-
Therefore, we present results from the single an- tences have a When, 15% of sentences have a
notator so we can do a more in-depth analysis. Where, and only 2.6% of sentences have a Why.)
Since each sentence had 5W’s, and there were 6 The most common monolingual system error on
systems that were compared, there were 7,500 these three questions was a missed answer, ac-
single-annotator judgments over 249 sentences. counting for all of the Where errors, all but one
Why error and 71% of the When errors. The re-
6 Results maining When errors usually occurred when the
system assumed the wrong sense for adverbs
Figure 1 shows the cross-lingual performance (such as “then” or “just”).
(on MT) of all the systems for each 5W. The best Missing Other Wrong/Multiple Wrong
monolingual performance (on human transla- 5W Predicates
tions) is shown as a dashed line (% Correct REF-func 37 29 22 7
only). If a system returned Incorrect answers for REF-LF 54 20 17 13
Who and What, then the other answers were MT-func 18 18 18 8
marked Incorrect (as explained in section 5.2). MT-LF 26 19 10 11
For the last 3W’s, the majority of errors were due Chinese 23 17 14 8
Hybrid 13 17 15 12
to this (details in Figure 1), so our error analysis
Table 3. Percentages of Who/What errors attributed to
focuses on the Who and What questions.
each system error type.
6.1 Monolingual 5W Performance The top half of Table 3 shows the reasons at-
To establish a monolingual baseline, the Eng- tributed to the Who/What errors for the reference
lish 5W system was run on reference (human) corpus. Since English-LF preferred shorter an-
translations of the Chinese text. For each partial swers, it frequently missed answers or parts of
answers. English-LF also had more Partial an- After running the hybrid system, 61% of the
swers on the What question: 66% Correct and answers were from English-LF, 25% from Eng-
12% Partial, versus 75% Correct and 1% Partial lish-function, 7% from Chinese-align, and the
for English-function. On the other hand, English- remaining 7% were from the other Chinese
function was more likely to return answers that methods (not evaluated here). The hybrid did
contained incorrect extra information, such as better than its parent systems on all 5Ws, and the
another 5W or a second predicate. numbers above indicate that further improvement
is possible with a better combination strategy.
6.2 Effect of MT on 5W Performance 6.4 Cross-Lingual 5W Error Analysis
The cross-lingual 5W task requires that systems For each Partial or Incorrect answer, annotators
return intelligible responses that are semantically were asked to select system errors, translation
equivalent to the source sentence (or, in the case errors, or both. (Further analysis is necessary to
of this evaluation, equivalent to the reference). distinguish between ASR errors and MT errors.)
As can be seen in Figure 1, MT degrades the The translation errors considered were:
performance of the 5W systems significantly, for
all question types, and for all systems. Averaged • Word/phrase deleted.
over all questions, the best monolingual system • Word/phrase mistranslated.
does 19% better than the best cross-lingual sys- • Word order mixed up.
tem. Surprisingly, even though English-function • MT unreadable.
outperformed English-LF on the reference data, Table 4 shows the translation reasons attrib-
English-LF does consistently better on MT. This uted to the Who/What errors. For all systems, the
is likely due to its use of multiple back-off meth- errors were almost evenly divided between sys-
ods when the parser failed. tem-only, MT-only and both, although the Chi-
nese system had a higher percentage of system-
6.3 Source-Language vs. Target-Language
only errors. The hybrid system was able to over-
The Chinese system did slightly worse than ei- come many system errors (for example, in Table
ther English system overall, but in the formal 2, only 13% of the errors are due to missing an-
text genre, it outperformed both English systems. swers), but still suffered from MT errors.
Although the accuracies for the Chinese and Mistrans- Deletion Word Unreadable
English systems are similar, the answers vary a lation Order
lot. Nearly half (48%) of the answers can be an- MT-func 34 18 24 18
MT-LF 29 22 21 14
swered correctly by both the English system and
Chinese 32 17 9 13
the Chinese system. But 22% of the time, the Hybrid 35 19 27 18
English system returned the correct answer when Table 4. Percentages of Who/What errors by each
the Chinese system did not. Conversely, 10% of system attributed to each translation error type.
the answers were returned correctly by the Chi-
Mistranslation was the biggest translation
nese system and not the English systems. The
problem for all the systems. Consider the first
hybrid system described in section 4.2 attempts
example in Figure 3. Both English systems cor-
to exploit these complementary advantages.
rectly extracted the Who and the When, but for
MT: After several rounds of reminded, I was a little bit
Ref: After several hints, it began to come back to me.
经过几番提醒,我回忆起来了一点点。
MT: The Guizhou province, within a certain bank robber, under the watchful eyes of a weak woman, and, with a
knife stabbed the woman.
Ref: I saw that in a bank in Guizhou Province, robbers seized a vulnerable young woman in front of a group of
onlookers and stabbed the woman with a knife.
看到贵州省某银行内,劫匪在众目睽睽之下,抢夺一个弱女子,并且,用刀刺伤该女子。
MT: Woke up after it was discovered that the property is not more than eleven people do not even said that the
memory of the receipt of the country into the country.
Ref: Well, after waking up, he found everything was completely changed. Apart from having additional eleven
grandchildren, even the motherland as he recalled has changed from a socialist country to a capitalist country.
那么醒来之后却发现物是人非,多了十一个孙子不说,连祖国也从记忆当中的社会主义国家变成了资本主义国家
Figure 3 Example sentences that presented problems for the 5W systems.
What they returned “was a little bit.” This is the 7 Conclusions
correct predicate for the sentence, but it does not
match the meaning of the reference. The Chinese In our evaluation of various 5W systems, we dis-
5W system was able to select a better translation, covered several characteristics of the task. The
and instead returned “remember a little bit.” What answer was the hardest for all systems,
Garbled word order was chosen for 21-24% of since it is difficult to include enough information
the target-language system Who/What errors, but to cover the top-level predicate and object, with-
only 9% of the source-language system out getting penalized for including too much.
Who/What errors. The source-language word The challenge in the When, Where and Why
order problems tended to be local, within-phrase questions is due to sparsity – these responses
errors (e.g., “the dispute over frozen funds” was occur in much fewer sentences than Who and
translated as “the freezing of disputes”). The tar- What, so systems most often missed these an-
get-language system word order problems were swers. Since this was a new task, this first
often long-distance problems. For example, the evaluation showed clear issues on the language
second sentence in Figure 3 has many phrases in analysis side that can be improved in the future.
common with the reference translation, but the The best cross-lingual 5W system was still
overall sentence makes no sense. The watchful 19% worse than the best monolingual 5W sys-
eyes actually belong to a “group of onlookers” tem, which shows that MT significantly degrades
(deleted). Ideally, the robber would have sentence-level understanding. A serious problem
“stabbed the woman” “with a knife,” rather than in MT for systems was deletion. Chinese con-
vice versa. Long-distance phrase movement is a stituents that were never translated caused seri-
common problem in Chinese-English MT, and ous problems, even when individual systems had
many MT systems try to handle it (e.g., Wang et strategies to recover. When the verb was deleted,
al. 2007). By doing analysis in the source lan- no top level predicate could be found and then all
guage, the Chinese 5W system is often able to 5Ws were wrong.
avoid this problem – for example, it successfully One of our main research questions was
returned “robbers” “grabbed a weak woman” for whether to extract or translate first. We hypothe-
the Who/What of this sentence. sized that doing source-language analysis would
Although we expected that the Chinese system be more accurate, given the noise in Chinese
would have fewer problems with MT deletion, MT, but the systems performed about the same.
since it could choose from three different MT This is probably because the English tools (logi-
versions, MT deletion was a problem for all sys- cal form extraction and parser) were more ma-
tems. In looking more closely at the deletions, ture and accurate than the Chinese tools.
we noticed that over half of deletions were verbs Although neither source-language nor target-
that were completely missing from the translated language analysis was able to circumvent prob-
sentence. Since MT systems are tuned for word- lems in MT, each approach had advantages rela-
based overlap measures (such as BLEU), verb tive to the other, since they did well on different
deletion is penalized equally as, for example, sets of sentences. For example, Chinese-align
determiner deletion. Intuitively, a verb deletion had fewer problems with word order, and most
destroys the central meaning of a sentence, while of those were due to local word-order problems.
a determiner is rarely necessary for comprehen- Since the source-language and target-language
sion. Other kinds of deletions included noun systems made different kinds of mistakes, we
phrases, pronouns, named entities, negations and were able to build a hybrid system that used the
longer connecting phrases. relative advantages of each system to outperform
Deletion also affected When and Where. De- all systems. The different types of mistakes made
leting particles such as “in” and “when” that in- by each system suggest features that can be used
dicate a location or temporal argument caused to improve the combination system in the future.
the English systems to miss the argument. Word Acknowledgments
order problems in MT also caused attachment This work was supported in part by the Defense
ambiguity in When and Where. Advanced Research Projects Agency (DARPA)
The “unreadable” category was an option of under contract number HR0011-06-C-0023. Any
last resort for very difficult MT sentences. The opinions, findings and conclusions or recom-
third sentence in Figure 3 is an example where mendations expressed in this material are the
ASR and MT errors compounded to create an authors' and do not necessarily reflect those of
unparseable sentence. the sponsors.
Teruko Mitamura, Eric Nyberg, Hideki Shima,
References Tsuneaki Kato, Tatsunori Mori, Chin-Yew Lin,
Ruihua Song, Chuan-Jie Lin, Tetsuya Sakai,
Collin F. Baker, Charles J. Fillmore, and John B. Donghong Ji, and Noriko Kando. 2008. Overview
Lowe. 1998. The Berkeley FrameNet project. In of the NTCIR-7 ACLIA Tasks: Advanced Cross-
COLING-ACL '98: Proceedings of the Conference, Lingual Information Access. In Proceedings of the
held at the University of Montréal, pages 86–90. Seventh NTCIR Workshop Meeting.
Xavier Carreras and Lluís Màrquez. 2005. Introduc- Alessandro Moschitti, Silvia Quarteroni, Roberto
tion to the CoNLL-2005 shared task: Semantic role Basili, and Suresh Manandhar. 2007. Exploiting
labeling. In Proceedings of the Ninth Conference syntactic and shallow semantic kernels for question
on Computational Natural Language Learning answer classification. In Proceedings of the 45th
(CoNLL-2005), pages 152–164. Annual Meeting of the Association of Computa-
Eugene Charniak. 2001. Immediate-head parsing for tional Linguistics, pages 776–783.
language models. In Proceedings of the 39th An- Kristen Parton, Kathleen R. McKeown, James Allan,
nual Meeting on Association For Computational and Enrique Henestroza. Simultaneous multilingual
Linguistics (Toulouse, France, July 06 - 11, 2001). search for translingual information retrieval. In
John Chen and Owen Rambow. 2003. Use of deep Proceedings of ACM 17th Conference on Informa-
linguistic features for the recognition and labeling tion and Knowledge Management (CIKM), Napa
of semantic arguments. In Proceedings of the 2003 Valley, CA, 2008.
Conference on Empirical Methods in Natural Lan- Slav Petrov and Dan Klein. 2007. Improved Inference
guage Processing, Sapporo, Japan. for Unlexicalized Parsing. North American Chapter
Katrin Erk and Sebastian Pado. 2006. Shalmaneser – of the Association for Computational Linguistics
a toolchain for shallow semantic parsing. Proceed- (HLT-NAACL 2007).
ings of LREC. Sudo, K., Sekine, S., and Grishman, R. 2004. Cross-
Daniel Gildea and Daniel Jurafsky. 2002. Automatic lingual information extraction system evaluation.
labeling of semantic roles. Computational Linguis- In Proceedings of the 20th international Confer-
tics, 28(3):245–288. ence on Computational Linguistics.
Daniel Gildea and Martha Palmer. 2002. The neces- Honglin Sun and Daniel Jurafsky. 2004. Shallow Se-
sity of parsing for predicate argument recognition. mantic Parsing of Chinese. In Proceedings of
In Proceedings of the 40th Annual Conference of NAACL-HLT.
the Association for Computational Linguistics Cynthia A. Thompson, Roger Levy, and Christopher
(ACL-02), Philadelphia, PA, USA. Manning. 2003. A generative model for semantic
Mary Harper and Zhongqiang Huang. 2009. Chinese role labeling. In 14th European Conference on Ma-
Statistical Parsing, chapter to appear. chine Learning.
Aria Haghighi, Kristina Toutanova, and Christopher Nianwen Xue and Martha Palmer. 2004. Calibrating
Manning. 2005. A joint model for semantic role la- features for semantic role labeling. In Dekang Lin
beling. In Proceedings of the Ninth Conference on and Dekai Wu, editors, Proceedings of EMNLP
Computational Natural Language Learning 2004, pages 88–94, Barcelona, Spain, July. Asso-
(CoNLL-2005), pages 173–176. ciation for Computational Linguistics.
Paul Kingsbury and Martha Palmer. 2003. Propbank: Xue, Nianwen and Martha Palmer. 2005. Automatic
the next level of treebank. In Proceedings of Tree- semantic role labeling for Chinese verbs. InPro-
banks and Lexical Theories. ceedings of the Nineteenth International Joint Con-
ference on Artificial Intelligence, pages 1160-1165.
Evgeny Matusov, Gregor Leusch, & Hermann Ney:
Learning to combine machine translation systems. Chao Wang, Michael Collins, and Philipp Koehn.
In: Cyril Goutte, Nicola Cancedda, Marc Dymet- 2007. Chinese Syntactic Reordering for Statistical
man, & George Foster (eds.) Learning machine Machine Translation. Proceedings of the 2007 Joint
translation. (Cambridge, Mass.: The MIT Press, Conference on Empirical Methods in Natural Lan-
2009); pp.257-276. guage Processing and Computational Natural Lan-
guage Learning (EMNLP-CoNLL), 737-745.
Adam Meyers, Ralph Grishman, Michiko Kosaka and
Shubin Zhao. 2001. Covering Treebanks with Jianqiang Wang and Douglas W. Oard, 2006. "Com-
GLARF. In Proceedings of the ACL 2001 Work- bining Bidirectional Translation and Synonymy for
shop on Sharing Tools and Resources. Annual Cross-Language Information Retrieval," in 29th
Meeting of the ACL. Association for Computa- Annual International ACM SIGIR Conference on
tional Linguistics, Morristown, NJ, 51-58. Research and Development in Information Re-
trieval, pp. 202-209.