You are on page 1of 6

From: AAAI-96 Proceedings. Copyright © 1996, AAAI ( All rights reserved.

Automatically Generating Extraction Patterns from Untagged Text

Ellen Riloff
Department of Computer Science
University of Utah
Salt Lake City, UT 84112

Abstract Corpus-based approaches to information extraction
have demonstrated a significant time savings over con-
Many corpus-based natural language processing sys-
tems rely on text corpora that have been manually ventional hand-coding methods (Riloff 1993). But the
annotated with syntactic or semantic tags. In partic- time required to annotate a training corpus is a non-
ular, all previous dictionary construction systems for trivial expense. To further reduce this knowledge-
information extraction have used an annotated train- engineering bottleneck, we have developed a system
ing corpus or some form of annotated input. We have
called AutoSlog-TS that generates extraction patterns
developed a system called AutoSlog-TS that creates
dictionaries of extraction patterns using only untagged using untagged text. AutoSlog-TS needs only a pre-
text. AutoSlog-TS is based on the AutoSlog system, classified corpus of relevant and irrelevant texts. Noth-
which generated extraction patterns using annotated ing inside the texts needs to be tagged in any way.
text and a set of heuristic rules. By adapting Au-
toSlog and combining it with statistical techniques, we
eliminated its dependency on tagged text. In experi- Generating Extraction Patterns from
ments with the MUG-4 terrorism domain, AutoSlog- Tagged Text
TS created a dictionary of extraction patterns that
performed comparably to a dictionary created by Au- Related work
toSlog, using only preclassified texts as input.
In the last few years, several systems have been de-
veloped to generate patterns for information extrac-
Motivation tion automatically. All of the previous systems de-
The vast amount of text becoming available on-line pend on manually tagged training data of some sort.
offers new possibilities for conquering the knowledge- One of the first dictionary construction systems was
engineering bottleneck lurking underneath most natu- AutoSlog (Riloff 1993)) which requires tagged noun
ral language processing (NLP) systems. Most corpus- phrases in the form of annotated text or text with asso-
based systems rely on a text corpus that has been man- ciated answer keys. PALKA (Kim & Moldovan 1993)
ually tagged in some way. For example, the Brown cor- is similar in spirit to AutoSlog, but requires manually
pus (Francis & Kucera 1982) and the Penn Treebank defined frames (including keywords), a semantic hierar-
corpus (Marcus, Santorini, & Marcinkiewicz 1993) are chy, and an associated lexicon. Competing hypotheses
widely used because they have been manually anno- are resolved by referring to manually encoded answer
tated with part-of-speech and syntactic bracketing in- keys, if available, or by asking a user.
formation. Part-of-speech tagging and syntactic brack- CRYSTAL (Soderland et al. 1995) also generates
eting are relatively general in nature, so these corpora extraction patterns using an annotated training cor-
can be used by different natural language processing pus. CRYSTAL relies on both domain-specific anno-
systems and for different domains. But some corpus- tations plus a semantic hierarchy and associated lex-
based systems rely on a text corpus that has been icon. LIEP (Huffman 1996) is another system that
manually tagged in a domain-specific or task-specific learns extraction patterns but relies on predefined key-
manner. For example, corpus-based approaches to in- words, object recognizers (e.g., to identify people and
formation extraction generally rely on special domain- companies), and human interaction to annotate each
specific text annotations. Consequently, the manual relevant sentence with an event type. Cardie (Cardie
tagging effort is considerably less cost effective because 1993) and Hastings (Hastings & Lytinen 1994) also
the annotated corpus is useful for only one type of NLP developed lexical acquisition systems for information
system and for only one domain. extraction, but their systems learned individual word

1044 Natural Language

Suppose that “Ricardo Castellar” was tagged as a rele- For example. and syntactic constituents. they assume the head noun? All modifiers or just the relevant modi- that the verb determines the role. passive. If there is more than with the specific words in the sentence to produce the one such sentence and the annotation does not indicate extraction pattern <victim> was kidnapped. Given a tagged noun phrase and the origi. was kidnapped ing heuristic rules. that refer to perpetrators. should the user annotate all conjuncts or just ‘In principle. including faulty sentence analysis. Therefore it would take roughly a week to construct a Figure 1: AutoSlog Heuristics training corpus of 1000 texts. riety of reasons. For AutoSlog. it took a user only 5 <subj> passive-verb <victim> was murdered hours to review 1237 extraction patterns (Riloff 1993). Au- be tagged.meanings rather than extraction patterns. or insufficient context. AutoSlog first identifies the sentence in passive-verb rule fires. consider the following sentence: tem that creates extraction patterns automatically us. based is prohibitive for most short-term applications. the mayor. <perp> attempted to kill Although this manual filtering process is part of the <subj> aux noun <victim> was victim . such as active. and prepositional phrases of each clause. <dobj> to kiII <victim> verb infin. AutoSlog passes the sentence to CIRCUS. di. fore a person must manually inspect each extraction mines which clause contains the targeted noun phrase pattern and decide which ones should be accepted and and applies the heuristic rules shown in Figure which ones should be rejected. generating the annotated training corpus is a much more substantial bottleneck. Ricardo Castellar. In most cases. the annotation task is deceptively ple. For exam- Furthermore. We included this pattern only because CIRCUS oc. passive verbs should not have direct ob- one? Should the user include appositives? How about jects. AutoSlog deter. In fu- which one is appropriate. AutoSlog (Riloff 1996) is a dictionary construction sys. on the syntactic class of the noun phrase. prepositional phrases ? The meaning of simple NPs can casionally confused active and passive constructions. Committing a domain expert to a knowledge-engineering project for a week The rules are divided into three categories. AutoSlog needs answer yesterday by the FMLN. An extraction pattern is created by instantiating tems used a semantic hierarchy and sentence contexts the rule with the specific words that it matched in the to learn the meanings of unknown words. Previ- active-verb prep <np> killed with <instrument> ous experiments with AutoSlog suggested that it took a passive-verb prep <np> was aimed at <target> user about 8 hours to annotate 160 texts (Riloff 1996). CIRCUS (L e h nert 1991) to identify clause boundaries and its subject will be extracted as a victim. evant noun phrases. passive-verb <dobj>’ killed <victim> active-verb <dobj> bombed <target > Generating Extraction Patterns from infin. toslog’s subject heuristics are tested and the <subj> nal source text. Each rule gen. then the subject rules apply. tive. this pattern will be activated whenever the first one. This manual filtering process is typically very fast. In experiments with PATTERN EXAMPLE the MUC-4 terrorism domain. But what constitutes a relevant erates an expression that likely defines the conceptual noun phrase ? Should the user include modifiers or just role of the noun phrase. the user must annotate rel- clause. AutoSlog needs only a flat AutoSlog can produce undesirable patterns for a va- syntactic analysis that recognizes the subject. noun phrases vant victim. and infini- junction. both in time and difficulty. sentence. Both sys. keys or text in which the noun phrases that should be extracted have been labeled with domain-specific tags. verb. The rules are ordered so the first one that is satisfied generates an extraction pattern. if the targeted noun phrase is the subject of a complex. AutoSlog invokes a sentence analyzer called verb “kidnapped” appears in a passive construction. As an example. As input. in a terrorism domain. <dobj> triedto attack <target > Untagged Text gerund <dobj> killing <victim> To tag or not to tag? noun aux <dobj> fatality was <victim> Generating an annotated training corpus is a signifi- noun prep <np> bomb against <target> cant undertaking. There- so almost any parser could be used. then AutoSlog chooses the ture texts. change substantially when a prepositional phrase is at- Learning 1045 . targets. This pattern is instantiated which the noun phrase appears. The rules recognize fiers? Determiners? If the noun phrase is part of a con- several verb forms. in- rect object. correct pp-attachment. with the AutoSlog longer patterns being tested before the shorter ones. knowledge-engineering cycle. <subj> active-verb <perp> bombed <subj> verb infin. and victims may which identifies Ricardo Castellar as the subject.

the heuristic rules gen. The two additional rules ety of these constructs in a single reference. is much easier to generate. which the experiments described in this paper.5 in which case the function re- turns zero because the pattern is negatively correlated 2 Ideally. Note that many patterns will be activated in rel- evant texts even though they are not domain-specific. Next. we use In Stage 1. but not specifying a rules to fire if more than one matches the context. Should experiment and are probably not very important for the user tag all references to a person? If not. thermore. AutoSlog-TS requires might produce two patterns to extract the terrorists: only a preclassified training corpus of relevant and ir. the irrelevant texts should be “near-miss” texts that are similar to the relevant texts. phrases. a ranking function to order them so that a person only tic analysis for each sentence and identifies the noun needs to review the most highly ranked patterns. unless the extract the noun phrase. The each pattern. The process is illustrated in Figure 2. we have developed a new ated in response to a single noun phrase. 3See (Riloff 1996) for a more detailed explanation. we rank the patterns in order of importance bombed by <y> <w saw I to the domain. that does not ple. “the Bank of Boston” is differ. In Stage 2. formula is: Pr(relevant text 1 text contains pattern.freq t is the number of instances of pattern.efi.3 A more sig- ones? It is difficult to specify a convention that reliably nificant difference is that AutoSlog-TS allows multiple captures the desired information. The sentence AutoSlog-TS is an extension of AutoSlog that operates analyzer activates all patterns that are applicable in exhaustively by generating an extraction pattern for each sentence. For the sake of simplicity. embassy” require any text annotations. AutoSlog-TS’s exhaustive approach to pattern generation can easily produce tens of thou- Figure 2: AutoSlog-TS flowchart sands of extraction patterns and we cannot reasonably expect a human to review them all. For example. since the user simply needs more general pattern is good enough or whether the to identify relevant and irrelevant sample texts.) = . ing every noun phrase in the corpus.t ached. For exam- version of AutoSlog.” Real texts are loaded toSlog plus two more: <subj> active-verb dobj with complex noun phrases that often include a vari. the sentence analyzer produces a syntac. & Q that were activated in relevant texts. we estimate the uates the extraction patterns by processing the corpus conditional probability that a text is relevant given a second time and generating relevance statistics for that it activates a particular extraction pattern.l Stage I t preclassified texts where rel . Therefore. More specifically.~. s: l!!arldum V: wasbombed rr) we will refer to this probability as a pattern’s relevance Pp: bye rate. general phrases such as “was reported” will appear in all sorts of texts. we process the training corpus a second AutoSlog-TS time using the new extraction patterns. is the total number of instances of pattern. the sentence “terrorists bombed the U. preclassified texts For example. 15 heuristic rules: the original 13 rules used by Au- ent from “the Bank of Toronto. called AutoSlog-TS. Fur. <subj> bombed and <subj> bombed embassy. AutoSlog-TS uses a set of relevance rate is 5 0. that were acti- vated in the training corpus. We rank the extraction patterns according to the erate a pattern (called a concept node in CIRCUS) to formula: relevance rate * log2(frequency). relevant texts for the domain. 1046 Natural Language . The motivation behind the conditional probability estimate is that domain- specific expressions will appear substantially more of- ten in relevant texts than irrelevant texts. It then eval. and infinitive prep <np>. we have a giant dictionary of ex- for many applications and could be easily exploited to traction patterns that are literally capable of extract- create a training corpus for AutoSlog-TS. We then compute relevance statistics every noun phrase in the training corpus.2 A preclassified corpus The statistics will later reveal whether the shorter. For each noun phrase. for each pattern.S.frr:~. multiple extraction patterns may be gener- To avoid these problems. a result. As convention can produce inconsistencies in the data. longer pattern is needed to be reliable for the domain. relevant texts are already available on-line At the end of Stage 1. There is were created for a business domain from a previous also the question of which references to tag. and total-freq.

we discarded patterns first using the AutoSlog dictionary and then using the that were proposed only once under the assumption AutoSlog-TS dictionary. An item was mis- terns using the ranking function. targets. the second pattern murder of <up> was Automated scoring programs were developed to eval.225 extrac. 9.2. extraction system was otherwise identical. We loaded the dictionary into CIRCUS. Learning 1047 .” The extraction pattern sists of deciding whether a pattern should be accepted acted correctly in this case. of texts and 25 irrelevant texts from the TST3 test set.with the domain (assuming the corpus is 50% relevant). Note that some of the patterns in Figure 3 were not put (MUC-4 Proceedings 1992). terns in about 85 minutes and then stopped because ment problem for any individual component is vir. Therefore. duced the size of the dictionary down to 11. we will 13. extract victims. We used 25 relevant structed using the 1500 MUC-4 development texts. accepted for the dictionary even though they are asso- The AutoSlog dictionary was constructed using the ciated with terrorism. if patterns are clearly associated with terrorism. few patterns were being accepted at that point. The 25 top-ranked labeled if it matched against the answer keys but was extraction patterns appear in Figure 3. 1.5 For tion was not specific enough. death of <np> one of <np> terns with the highest relevance rates were promoted.345 unique extraction patterns. the dictionary more manageable. we chose 100 blind tion patterns. 10. We do not claim that this particular 1. <subj> was killed 17 responsibility for <np> high frequency patterns to be considered even if their 5. Correct items extracted more than once were also scored as duplicates. <subj> was wounded argue later that a better function is needed. exploded in <np> 21. To make the size of TST4 test set. <subj> kidnapped then crucial expressions like this would be buried in 11. To evaluate the two dictionaries. du- reprocessed the corpus. answer keys as the basis for judging “correct” out. but the extracted informa- or rejected. which about 50% are relevant. An item 4The author did the manual review for this experiment. In tually impossible using only the scores produced by total. If only the pat. claimed <dobj> 25. <subj> was injured 20 destroyed <dobj> 8. and weapons were keys. We scored the output by assigning each extracted tion patterns. but the credit assign. and labeling the accepted patterns. <subj> was kidnapped 18 occurred on <np> relevance rate is only moderate (say 70%) because of 6. AutoSlog-TS generated plus 25 relevant texts and 25 irrelevant texts from the 32. For example. we evaluated AutoSlog and nal dictionary. The underlying information that they were not likely to be of much value. An item was a du- pulling the domain-specific patterns up to the top. it matched against the answer keys. The user reviewed the top 1970 pat- sage understanding conferences. It is important for 4. Most of these extracted as the wrong type of object. This re. <subj> was murdered both relevant and irrelevant texts. 210 extraction patterns were retained for the fi- these programs. Finally. largely because the ranking scheme clus- of their dictionaries in the MUC-4 terrorism domain. attack on <np> 19 was wounded in <np> expressions like “was killed” which occur frequently in 7. But this Figure 3: The Top 25 Extraction Patterns function worked reasonably well in our experiments. murder of <np> 15 <subj> was located This formula promotes patterns that have a high rel- 3. The review time was much faster than AutoSlog-TS by manually inspecting the performance for the contrary. if “him” was extracted and coref- to a user for manual review. we ranked all 11. victims. <subj> took-place 23. and computed the relevance plicate. For example. accepted and labeled as a murder pattern that will uate information extraction (IE) systems for the mes. mislabeled. the pattern exploded in <np> which were manually filtered in about 5 person-hours. tered the best patterns near the top so the retention We used the MUC-4 texts as input and the MUC-4 rate dropped quickly.4 The review process con. 22. The AutoSlog-TS dictionary was con. The final AutoSlog dictionary contained 450 extrac. or spurious. Only patterns useful for extract- 772 relevant MUC-4 texts and their associated answer ing perpetrators. so the “Hector Colindres” was listed as a murder victim but ranking function appears to be doing a good job of was extracted as a physical target. For example. exploded on <np> the ranked list. <subj> died ranking function is the best . texts from the MUC-4 test set. caused <dobj> 24. item to one of four categories: correct. An item was scored as correct if rate of each pattern. was rejected because it would extract locations. AutoSlog produced 1237 extraction patterns. <sub j > exploded 14 <sub j > occurred 2. Experimental Results example. We ran CIRCUS on these 100 texts. plicate if it was coreferent with an item in the answer The ranked extraction patterns were then presented keys. kept. tally by referring to the text annotations.225 pat. 5Note that AutoSlog’s patterns were labeled automati. assassination of <np> 16 took-place on <np> evance rate or a high frequency. erent with “Hector Colindres.

they can be incorporated Table 2: AutoSlog-TS Results as selectional restrictions in the patterns. These results suggest mation using only local context.05 significance level. extracted victims should satisfy a human con- We applied a well-known statistical technique.27 re- a discourse analyzer that used domain-specific rules to spectively.was spurious if it did not refer to any object in the an. so AutoSlog-TS achieved a comparable level similar. patterns. Semantic features were not used in the cur- two-sample t test. and 2.1012. perp 36 22 1 11 129 Victim 41 i 24 7 18 113 Target 39 1 19 8 18 108 I II II I I J Total 116 1 65 16 47 350 Table 3: Comparative Results Table 1: AutoSlog Results The AutoSlog precision results are substantially lower than those generated by the MUC-4 scoring program (Riloff 1993). data: recall. The t values for based on the UMass/MUC-4 system. correct tionaries. Table 3 shows that Au- were not extracted were counted as missing. the previously reported scores were + duplicate. Dup. Without corpus tagging. It is informative to look behind the scenes and try to tions. but were never scored as missing. and miss- distinguish terrorist incidents from other events. in our case with equal weight. victims. toSlog achieved slightly higher recall and AutoSlog-TS fore correct + missing should equal the total number achieved higher precision. sions. The F-measure tagging is that the annotations provide guidance so the system can more easily hone in on the relevant expres- 6 “Optional” items in the answer keys were scored as correct if extracted.1818.1557. The performance of the two dictionaries was very patterns. and the F-measure. Note manually judging the output of the dictionaries. The correct. precision. was significantly different at the p < 0. but undoubt- ences between the dictionaries were statistically sig- edly would have improved the precision of both dic- nificant.7 I II AutoSlog II AutoSlog-TS 1 Slot Corr. more correct items. and frequency patterns that were buried deep in AutoSlog- computed precision a~ (correct + duplicate) / (correct TS’s ranked list. 1. Miss. Therefore AutoSlog-TS was Behind the scenes significantly more effective at reducing spurious extrac. the straint. were spurious. We tested four data sets: correct. There are several reasons for the difference. understand why AutoSlog achieved slightly better re- We applied three performance metrics to this raw call and why AutoSlog-TS achieved better precision. All items extracted from irrelevant texts sion into a single value. Spur. 0. The F-measure scores were of items in the answer keys. Al- though AutoSlog does not use semantic features to create extraction patterns.20 significance level. not AutoSlog-TS. Finally. which included these sets were 1. Mislab. Also. to determine whether the differ- rent experiments for technical reasons. we are at the mercy 7The difference in mislabeled items is an artifact of the of the ranking function.6 similar for both systems. and spurious. and tar. missing. We that the AutoSlog-TS dictionary contained only 210 scored three items: perpetrators. ous data. The AutoSlog dictionary extracted slightly of recall with a dictionary less than half the size. items in the answer keys that As the raw data suggests. CIR- ing data sets were not significantly different even at CUS was designed to extract potentially relevant infor- the p < 0. the current experiments were done with a debilitated version of CIRCUS that did not process conjunctions or semantic features. We believe that the ranking human review process. (MUC-4 Proceedings 1992) combines recall and preci- swer keys. correct + duplicate. For one. but AutoSlog-TS obtained Tables 1 and 2 show the numbers obtained after slightly higher F scores for victims and targets. but the AutoSlog-TS dictionary extracted fewer spurious items. For exam- ple. The main advantage of corpus- + duplicate + mislabeled + spurious). while the AutoSlog dictionary contained 450 gets. There. under the assumption that AutoSlog and AutoSlog-TS can extract relevant that a complete IE system would contain a discourse information with comparable performance. however. We Most of AutoSlog’s additional recall came from low calculated recall as correct / (correct + missing). The spuri- analyzer to make global decisions about relevance. function did a good job of pulling the most impor- 1048 Natural Language .

1996. but were not in fact highly correlated with relevance. Santorini.. A potential problem with AutoSlog-TS wood. MA: Houghton Mifflin.. J. A Case-Based Approach to Knowledge ample. CRYSTAL: Inducing a conceptual dictionary. H. and spend some time filtering and labeling Soderland. tem to generate domain-specific extraction patterns Riloff. terns looked good to the human reviewer. Springer-Verlag. C. S. traction patterns should float to the top so that the AAAI Press/The MIT Press. Only 45 of these patterns were in the original and Jay Shoen for generating much of the data. relevant). such as text classification (Riloff & ence on Artificial Intelligence. Nor- mance levels.. Automatically Constructing a Dictio- making information extraction systems more easily nary for Information Extraction Tasks. and Shoen. we have reason to believe that AutoSlog-TS fied texts for all they’re worth.. Los Alamitos.. Proceedings of the Third Workshop on Very Large Cor- plex extraction patterns. the resulting extraction patterns. San Ma- ter ranking scheme might be able to balance these two teo. Fisher. D. but this ficial Intelligence.. creating more com. 135-164. J. Artificial Intelligence. buried deep in the ranked list. E. Some of AutoSlog’s pat- English Usage. the “heavy hitter” ex- National Conference on Artificial Intelligence. Aseltine. The specific number of pat.. Scheler. S. 1993. AutoSlog-TS represents a new Learning 1049 . In Proceedings of the Eleventh National Conference on Artificial Intelli- portable across domains.. In Proceedings of the Ninth IEEE Conference on Artificial The previous results suggest that a core dictionary of Intelligence for Applications. In constraints and. Vol. J. Connectionist. Symbolic/Subsymbolic Sentence Anal- ysis: Exploiting the Best of Two Worlds. which cumulatively Building a Large Annotated Corpus of English: The Penn could have a substantial impact on performance. W. Domains. recall was achieved after reviewing about 300 patterns. in the long run. 1995. 1991. CA: IEEE Computer Society Press. AAAI Press/The MIT Press.. Ablex Publishers. AutoSlog-TS produced 158 This research was funded by NSF grant MIP-9023174 patterns with a relevance rate 2 90% and frequency and NSF grant IRI-9509820.. and Kucera. 1993. 148-161. E. 1993. M. is ultimately capable of producing better recall than AutoSlog because it generated many good patterns Acknowledgments that AutoSlog did not. Advances in Connectionist and Neu- on the breadth of the domain and the desired perfor. Hastings. eds. M. Riloff. E. and Symbolic ing only the first 50 extraction patterns! Almost 50% Approaches to Learning for Natural Language Processing. most important patterns (in terms of recall) are re- Huffman. Cardie. Frequency Analysis of vance (46%) by AutoSlog-TS. 1. Acquisition of Semantic Future Directions Patterns for Information Extraction from Corpora. 754-759. S. In Barnden. The Ups and Downs of Lexical Acquisition. and Marcinkiewicz. AutoSlog-TS is the first sys- gence. Almost 35% recall was achieved after review. ral Computation Theory.. For ex. 1982. Shoen 1995). J. Message Understanding Conference (MUC-4). AAAI Press/The MIT Press. The precision of the extraction Riloff. the AutoSlog dictionary contains an extraction Acquisition for Domain-Specific Sentence Analysis. Finally. In struction also opens the door for IE technology to sup. E. G. Automatically Acquiring patterns could also be improved by adding semantic Conceptual Patterns Without an Annotated Corpus. 1996. J. terns that need to be reviewed will ultimately depend and PolIack. Proceedings of the Fourteenth International Joint Confer- port other tasks. CA: Morgan Kaufmann. In Wermter. and Lehnert. Forthcoming. and Moldovan. In Proceedings of the Twelfth In an ideal ranking scheme. specific knowledge acquisition by squeezing preclassi- In fact. 171-176. but a bet.. 85. Proceedings of the Fourth high frequency patterns to float to the top. eds. An Empirical Study of Automated Dic- automatically without annotated training data. Computational Linguistics 19(2):313-330. Kim. 1314-1319.. pattern was found to be negatively correlated with rele- Francis. Learning information extraction pat- viewed first. The Treebank. AutoSlog-TS was very successful in this terns from examples. 1992.. needs more effectively. P. pora. extraction patterns can be created after reviewing only a few hundred patterns.. but additional research is approach to exploiting on-line text corpora for domain- needed to recognize good low frequency patterns.tant patterns up to the top. 798-803. Thanks to Kern Mason 2 5. W. and Lytinen. NJ. Vol. Boston. Statistical. AutoSlog dictionary. 811-816. In Proceedings of the Eleventh National Conference on Arti- pattern for the expression <subj> admitted. 1993. current ranking scheme is biased towards encouraging MUC-4 Proceedings. Lehnert . is that there are undoubtedly many useful patterns Marcus. D. A user tionary Construction for-Information Extraction in Three only needs to provide sample texts (relevant and ir. Fast dictionary con- 1995. 1994. Berlin. AutoSlog-TS represents an important step towards Riloff. and regard. B. W.. The higher precision demonstrated by AutoSlog-TS References is probably a result of the relevance statistics. S.