You are on page 1of 15

AI Magazine Volume 18 Number 4 (1997) (© AAAI)

Articles

Empirical Methods in
Information Extraction
Claire Cardie

■ This article surveys the use of empirical, machine- and time that it occurred, and data on any
learning methods for a particular natural lan- property damage or human injury caused by
guage–understanding task—information extrac- the event.
tion. The author presents a generic architecture for Information extraction has figured promi-
information-extraction systems and then surveys
nently in the field of empirical NLP: The first
the learning algorithms that have been developed
to address the problems of accuracy, portability,
large-scale, head-to-head evaluations of NLP
and knowledge acquisition for each component of systems on the same text-understanding tasks
the architecture. were the Defense Advanced Research Projects
Agency–sponsored Message-Understanding
Conference (MUC) performance evaluations of
information-extraction systems (Chinchor,

M
ost corpus-based methods in natural
language processing (NLP) were de- Hirschman, and Lewis 1993; Lehnert and
veloped to provide an arbitrary text- Sundheim 1991). Prior to each evaluation, all
understanding application with one or more participating sites receive a corpus of texts
general-purpose linguistic capabilities, as evi- from a predefined domain as well as the corre-
denced by the articles in this issue of AI Maga- sponding answer keys to use for system devel-
zine. Author Eugene Charniak and coauthors opment. The answer keys are manually encoded
Ng Hwee Tou and John Zelle, for example, de- templates—much like that of figure 1—that
scribe techniques for part-of-speech tagging, capture all information from the correspond-
parsing, and word-sense disambiguation. ing source text that is relevant to the domain,
These techniques were created with no specific as specified in a set of written guidelines. After
domain or high-level language-processing task a short development phase,1 the NLP systems
in mind. In contrast, my article surveys the use are evaluated by comparing the summaries
of empirical methods for a particular natural each produces with the summaries generated
language–understanding task that is inherently by human experts for the same test set of pre-
domain specific. The task is information ex- viously unseen texts. The comparison is per-
traction. Generally, an information-extraction formed using an automated scoring program
system takes as input an unrestricted text and that rates each system according to measures of
“summarizes” the text with respect to a pre- recall and precision. Recall measures the
specified topic or domain of interest: It finds amount of the relevant information that the
useful information about the domain and en- NLP system correctly extracts from the test col-
codes the information in a structured form, lection; precision measures the reliability of the
suitable for populating databases. In contrast information extracted:
to in-depth natural language–understanding recall = (# correct slot fillers in
tasks, information-extraction systems effec- output templates) / (# slot
tively skim a text to find relevant sections and fillers in answer keys)
then focus only on these sections in subse-
quent processing. The information-extraction precision = (# correct slot fillers in
system in figure 1, for example, summarizes output templates) / (# slot
stories about natural disasters, extracting for fillers in output templates) .
each such event the type of disaster, the date As a result of MUC and other information-ex-

Copyright © 1997, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1997 / $2.00 WINTER 1997 65
Articles

4 Apr Dallas - Early last evening, a tornado


swept through an area northwest of Dallas,
causing extensive damage. Witnesses
confirm that the twister occurred without
Free Text warning at approximately 7:15 p.m. and
destroyed two mobile homes. The Texaco
station, at 102 Main Street, Farmers
Branch, TX,was also severely damaged, but
no injuries were reported. Total property
damages are estimated to be $350,000.

Information
Extraction
System

Event: tornado
Date: 4/3/97
Time: 19:15
Output Location: Farmers Branch : "northwest of Dallas" : TX : USA

Template Damage: "mobile homes" (2)


"Texaco station" (1)
Estimated Losses: $350,000
Injuries: none

Figure 1. An Information-Extraction System in the Domain of Natural Disasters.

traction efforts, information extraction has be- ing business joint ventures (MUC-5 1994); and
come an increasingly viable technology for re- (6) support the automatic classification of legal
al-world text-processing applications. For ex- documents (Holowczak and Adam 1997).
ample, there are currently information- A growing number of internet applications
extraction systems that (1) support underwrit- also use information-extraction technologies.
ers in analyzing life insurance applications Some examples include NLP systems that build
(Glasgow et al. 1997); (2) summarize medical knowledge bases directly from web pages
patient records by extracting diagnoses, symp- (Craven et al. 1997); create job-listing databas-
toms, physical findings, test results, and thera- es from news groups, web sites, and classified
peutic treatments to assist health-care providers advertisements (see www.junglee.com/suc-
or support insurance processing (Soderland, cess/index.html); build news group query sys-
Aronow, et al. 1995); (3) analyze news wires and tems (Thompson, Mooney, and Tang 1997);
transcripts of radio and television broadcasts to and create weather forecast databases from
find and summarize descriptions of terrorist ac- web pages (Soderland 1997).
tivities (MUC-4 1992; MUC-3 1991); (4) moni- Although the MUC evaluations have shown
tor technical articles describing microelectronic that it is possible to rigorously evaluate some
chip fabrication to capture information on aspects of an information-extraction system, it
chip sales, manufacturing advances, and the is difficult to state the overall performance lev-
development or use of chip-processing tech- els of today’s information-extraction systems:
nologies (MUC-5 1994); (5) analyze newspaper At a minimum, performance depends on the
articles with the goal of finding and summariz- relative complexity of the extraction task, the

66 AI MAGAZINE
Articles

quality of the knowledge bases available to the The Architecture of an


NLP system, the syntactic and semantic com-
plexity of the documents to be processed, and
Information-Extraction System
the regularity of the language in the docu- In the early days of information extraction,
ments. In general, however, the best extraction NLP systems varied widely in their approach to
systems now can achieve levels of about 50- the information-extraction task. At one end of
percent recall and 70-percent precision on fair- the spectrum were systems that processed a
ly complex information-extraction tasks and text using traditional NLP techniques: (1) a full
can reach much-higher levels of performance syntactic analysis of each sentence, (2) a se-
(approximately 90-percent recall and mantic analysis of the resulting syntactic struc-
precision) for the easiest tasks. Although these tures, and (3) a discourse-level analysis of the
levels of performance might not initially seem syntactic and semantic representations. At the
impressive, one should realize that informa- other extreme lie systems that used keyword-
tion extraction is difficult for people as well as matching techniques and little or no linguistic
machines. Will’s (1993) study, for example, analysis of the input text. As more informa-
showed that the best machine-extraction sys- tion-extraction systems were built and empiri-
tems have an error rate that is only twice that cally evaluated, however, researchers began to
of highly skilled analysts specifically trained in converge on a standard architecture for infor-
information-extraction tasks. mation-extraction systems. This architecture is
In spite of this recent progress, today’s infor- shown in figure 2. Although many variations
mation-extraction systems still have problems: exist from system to system, the figure indi-
First, the accuracy and robustness of machine- cates the main functions performed in an in-
extraction systems can be improved greatly. In formation-extraction system.
particular, human error during information Each input text is first divided into sen- …
extraction is generally caused by a lapse of at- tences and words in a tokenization and tag-
tention, but the errors of an automated extrac- ging step. As indicated in figure 2, many sys- empirical
tion system are the result of its relatively shal- tems also disambiguate, or tag, each word with methods for
respect to part of speech and, possibly, seman-
low understanding of the input text. As a
tic class at this point during processing. The
information
result, the machine-generated errors are more
difficult to track down and correct. Second,
sentence-analysis phase follows. It comprises extraction are
one or more stages of syntactic analysis, or
building an information-extraction system in
parsing, that together identify noun groups,
corpus-based,
a new domain is difficult and time consuming,
often requiring months of effort by domain
verb groups, prepositional phrases, and other machine-
simple constructs. In some systems, the parser learning
specialists and computational linguists famil-
also locates surface-level subjects and direct
iar with the underlying NLP system. Part of the
objects and identifies conjunctions, apposi- algorithms.
problem lies in the domain-specific nature of
tives, and other complex phrases. At some
the task: An information-extraction system
point, either before, during, or after the main
will work better if its linguistic knowledge
steps of syntactic analysis, an information-ex-
sources are tuned to the particular domain, but traction system also finds and labels semantic
manually modifying and adding domain-spe- entities relevant to the extraction topic. In the
cific linguistic knowledge to an existing NLP natural disaster domain, for example, the sys-
system is slow and error prone. tem might identify locations, company names,
The remainder of the article surveys the em- person names, time expressions, and money
pirical methods in NLP that have been devel- expressions, saving each in a normalized form.
oped to address these problems of accuracy, Figure 2 shows the syntactic constituents
portability, and knowledge acquisition for in- and semantic entities identified during sen-
formation-extraction systems. Like the com- tence analysis for the first sentence of the sam-
panion articles in this issue, we see that empir- ple text. There are important differences be-
ical methods for information extraction are tween the sentence-analysis stage of an
corpus-based, machine-learning algorithms. information-extraction system and traditional
To start, I present a generic architecture for in- parsers. Most importantly, the goal of syntactic
formation-extraction systems. Next, I provide analysis in an information-extraction system is
examples of the empirical methods designed not to produce a complete, detailed parse tree
to increase the accuracy or the portability of for each sentence in the text. Instead, the sys-
each component in the extraction system. tem need only perform partial parsing; that is,
Throughout, I focus on the specific needs and it need only construct as much structure as the
constraints that information extraction places information-extraction task requires. Unlike
on the language-learning tasks. traditional full-sentence parsers, a partial pars-

WINTER 1997 67
Articles

Early last evening adverbial phrase:time


4 Apr Dallas - Early last Early/adv last/adj a tornado noun group/subject
evening, a tornado swept evening/ noun/time ,/, swept verb group
through an area northwest a/det through an area prep phrase:location
of Dallas, causing tornado/noun/weather northwest of Dallas adverbial phrase:location
extensive damage... swept/ verb through/prep ... causing verb group
extensive damage... noun group/object

Tokenization Sentence
and Tagging Analysis

Template
Merging Extraction
Generation

Early last evening, a Phrase Extracted information


Event: tornado
tornado swept through an
Date: 4/3/97 tornado swept -> Event: tornado
area northwest of Dallas...
Time: 19:15 tornado swept through an area -> Location: "area"
Witnesses confirmed
... ... area northwest of Dallas -> Location: "northwest of Dallas"
that the twister ...
causing extensive damage -> Damage
...

Figure 2. Architecture for an Information-Extraction System.

er looks for fragments of text that can reliably The main job of the merging phase is corefer-
be recognized, for example, noun groups and ence resolution, or anaphora resolution: The
verb groups. Because of its limited coverage, a system examines each entity encountered in
partial parser can rely solely on general pat- the text and determines whether it refers to an
tern-matching techniques—often finite-state existing entity or whether it is new and must
machines—to identify these fragments deter- be added to the system’s discourse-level repre-
ministically based on purely local syntactic sentation of the text. In the sample text, for ex-
cues. Partial parsing is well suited for informa- ample, the mention of a tornado in the first
tion-extraction applications for an additional sentence indicates a new entity; the twister in
reason: The ambiguity-resolution decisions sentence two, however, refers to the same en-
that make full-blown parsing difficult can be tity as the tornado in sentence one. Recogniz-
postponed until later stages of processing ing when two statements refer to the same en-
where top-down expectations from the infor- tity is critical for an information-extraction
mation-extraction task can guide the system’s system because it allows the system to associ-
actions. ate the information from both statements with
The extraction phase is the first entirely do- the same object. In some systems, another task
main-specific component of the system. Dur- of merging is to determine the implicit sub-
ing extraction, the system identifies domain- jects of all verb phrases. In sentence one, this
specific relations among relevant entities in component would infer that tornado is the sub-
the text. Given the first sentence in our exam- ject of causing (as well as the subject of swept),
ple, this component should identify the type allowing the system to directly associate dam-
of natural disaster (tornado), the location of the age with the tornado. The discourse-level infer-
event (area and northwest of Dallas, TX), and ences made during merging aid the template-
the fact that there was some property damage. generation phase, which determines the
Figure 2 shows the information extracted from number of distinct events in the text, maps the
the first sentence and those portions of text re- individually extracted pieces of information
sponsible for each piece of extracted data. In- onto each event, and produces output tem-
formation for filling the remaining slots of the plates. Purely domain-specific inferences can
output template would similarly be extracted also occur during template generation. In the
from subsequent sentences. MUC terrorism domain, for example, terrorist

68 AI MAGAZINE
Articles

events involving only military targets were not algorithms: Some examples include using hid-
considered relevant unless civilians were in- den Markov models (HMMs) for part-of-speech
jured, or there was damage to civilian property. tagging and statistical learning techniques for
The template-generation phase is often the parsing (see Charniak [1993] and Weischedel et
best place to apply this domain-specific con- al. [1993]), Brill’s (1995) transformation-based
straint. In addition, some slots in the output learning for part-of-speech tagging and brack-
template must be filled with terms chosen eting (Ramshaw and Marcus 1995), decision
from a set of possibilities rather than a string tree models for parsing (Magerman 1995), case-
from the input text. In the sample scenario, based learning for lexical tagging (Daelemans
the Injuries and Event slots require such set et al. 1996; Cardie 1993), and inductive logic
fills. Still other slots (for example, Date, Time, programming for learning syntactic parsers
Location) might require normalization of their (Zelle and Mooney 1994). The resulting taggers
fillers. Both of these subtasks are part of the and bracketers will be effective across informa-
template-generation phase. tion-extraction tasks as long as the input to the
information-extraction system uses a writing
style and genre that is similar to the training
The Role of Corpus-Based– corpus. Otherwise, a new training corpus must
Language Learning Algorithms be created and used to completely retrain or … corpus-
With this architecture in mind, we can now re- bootstrap the training of the component. In based–
turn to our original question: How have re- theory, word-sense–disambiguation algorithms
searchers used empirical methods in NLP to im- would also be portable across extraction tasks. language
prove the accuracy and portability of However, defining standard word senses is dif- learning
information-extraction systems? In general, ficult, and to date, text collections have been
annotated according to these predefined senses algorithms
corpus-based–language learning algorithms
have been used to improve individual compo- only for a small number of selected words. In have been
addition, the importance of word-sense disam-
nents of the information-extraction system
biguation for information-extraction tasks re-
used to
and, as a result, to improve the end-to-end per-
formance. In theory, empirical methods can be mains unclear. improve
used for each subtask of information extrac- Natural language–learning techniques are individual
tion: part-of-speech tagging, semantic-class tag- more difficult to apply to subsequent stages of
ging, word-sense disambiguation, named enti- information extraction—namely, the learning components
ty identification (for example, company of extraction patterns, coreference resolution, of the
names, person names, locations), partial pars- and template generation. There are a number
ing, extraction-pattern learning, coreference of problems: First, there are usually no corpora information-
resolution, and each step of template genera- annotated with the appropriate semantic and extraction
tion. The catch, as is often the case in corpus- domain-specific supervisory information. The
typical corpus for information-extraction tasks
system and,
based approaches to language learning, is ob-
taining enough training data. As described in is a collection of texts and their associated an- as a result, to
swer keys, that is, the output templates that
the overview article, supervised language learn-
should be produced for each text. Thus, a new
improve the
ing algorithms acquire a particular language-
processing skill by taking many examples of corpus must be created for each new informa- end-to-end
how to correctly perform the task and then tion-extraction task. In addition, the corpus performance.
generalizing from the examples to handle un- simply does not contain the supervisory infor-
seen cases. The algorithms, therefore, critically mation needed to train most components of
depend on the existence of a corpus that has an information-extraction system, including
been annotated with the appropriate supervi- the lexical-tagging, coreference-resolution,
sory information. For language tasks that are and template-generation modules. The output
primarily domain independent and syntactic templates are often inadequate even for learn-
in nature, annotated corpora such as the Penn ing extraction patterns: They indicate which
tree bank (Marcus, Marcinkiewicz, and Santori- strings should be extracted and how they
ni 1993) already exist and can be used to ex- should be labeled but say nothing about which
tract training data for the information-extrac- occurrence of the string is responsible for the
tion system. Part-of-speech tagging and extraction when multiple occurrences appear
bracketing text into noun groups, verb groups, in the text. Furthermore, they provide no di-
clauses, and so on, fall into this category. For rect means for learning patterns to extract set
these tasks, one can use the tree bank’s Wall fills, symbols not necessarily appearing any-
Street Journal corpus, which has been annotated where in the text. As a result, researchers create
with both word class and syntactic structure, their own training corpora, but because this
together with any of a number of corpus-based process is slow, the resulting corpora can be

WINTER 1997 69
Articles

much smaller than is normally required for Learning Extraction Patterns


statistical approaches to language analysis.
Another problem is that the semantic and As in the sentence-analysis stages, general pat-
domain-specific language-processing skills tern-matching techniques have also become
needed for information extraction often re- the technique of choice for the extraction
quire the output of earlier levels of analysis, for phase of an information-extraction system
example, tagging and partial parsing. This re- (MUC-6 1995). The role for empirical methods
in the extraction phase, therefore, is one of
quirement complicates the generation of train-
knowledge acquisition: to automate the acqui-
ing examples for the learning algorithm be-
sition of good extraction patterns, where good
cause there can be no standard corpus from
patterns are those that are general enough to
which complete training examples can be read
extract the correct information from more
off, as is the case for part-of-speech tagging and
than one sentence but specific enough to not
parsing. The features that describe the learning
apply in inappropriate contexts. A number of
problem depend on the information available researchers have investigated the use of cor-
to the extraction system in which the learning pus-based methods for learning information-
algorithm is embedded, and these features be- extraction patterns. The learning methods
come available only after the training texts vary along a number of dimensions: the class
have passed through earlier stages of linguistic of patterns learned, the training corpus re-
analysis. Whenever the behavior of these ear- quired, the amount and type of human feed-
lier modules changes, new training examples back required, the degree of preprocessing nec-
must be generated and the learning algorithms essary, the background knowledge required,
for later stages of the information-extraction and the biases inherent in the learning algo-
system retrained. Furthermore, the learning al- rithm itself.
One of the gorithms must deal effectively with noise One of the earliest systems for acquiring ex-
earliest caused by errors from earlier components. The traction patterns was AUTOSLOG (Riloff 1993;
cumulative effect of these complications is Lehnert et al. 1992). AUTOSLOG learns extraction
systems that the learning algorithms used for low-level patterns in the form of domain-specific con-
for tagging or syntactic analysis might not readily cept-node definitions for use with the CIRCUS
apply to the acquisition of these higher-level parser (Cardie and Lehnert 1991; Lehnert
acquiring language skills, and new algorithms often need 1990). AUTOSLOG’s concept nodes can be viewed
extraction to be developed. as domain-specific semantic case frames that
In spite of the difficulties of applying empir- contain a maximum of one slot for each frame.
patterns ical methods to problems in information ex- Figure 3, for example, shows the concept node
was traction, it is precisely the data-driven nature for extracting two mobile homes as damaged
property from sentence two of the sample text.
AUTOSLOG. of corpus-based approaches that allows them
to simultaneously address both of the major The first field in the concept node specifies the
problems for information-extraction sys- type of concept to be recognized (for example,
tems—accuracy and portability. When the Damage). The concept type generally corre-
training data are derived from the same type of sponds to a specific slot in the output template
(for example, the Damage slot of figure 1). The
texts that the information-extraction system is
remaining fields in the concept node represent
to process, the acquired language skills are au-
the extraction pattern. The trigger is the word
tomatically tuned to that corpus, increasing
that activates the pattern—it acts as the pat-
the accuracy of the system. In addition, be-
tern’s conceptual anchor point. The position
cause each natural language–understanding
denotes the syntactic position where the con-
skill is learned automatically rather than man-
cept is expected to be found in the input sen-
ually coded into the system, the skill can be tence (for example, the direct object, subject,
moved quickly from one information-extrac- object of a preposition); the constraints are se-
tion system to another by retraining the ap- lectional restrictions that apply to any poten-
propriate component. tial instance of the concept. In CIRCUS, these se-
The remaining sections describe the natural mantic constraints can be hard or soft: Hard
language–learning techniques that have been constraints are predicates that must be satisfied
developed for training the domain-dependent before the phrase in the specified position can
and semantic components of an information- be extracted as an instance of the concept; soft
extraction system: extraction, merging, and constraints suggest preferences for slot fillers
template generation. In each case, I describe but do not inhibit the extraction of phrases if
how the previously discussed problems are ad- violated. In all our examples, we assume that
dressed and summarize the state of the art in the constraints are hard. Finally, the enabling
the field. conditions are constraints on the linguistic con-

70 AI MAGAZINE
Articles

text of the triggering word that must be satis-


fied before the pattern is activated. The con-
cept node of figure 3, for example, would be Sentence Two:
triggered by the word destroyed when used in
“Witnesses confirm that the twister occurred
the active voice. Once activated, the concept
without warning at approximately 7:15 p.m.
node would then extract the direct object of
and destroyed two mobile homes.”
the clause to fill its Damage slot as long as the
phrase denoted a physical object. Concept-Node Definition:
Once the concept node is defined, it can be Concept = Damage
used in conjunction with a partial parser to ex- Trigger = “destroyed”
tract information from novel input sentences. Position = direct-object
The system parses the sentence; if the trigger Constraints = ((physical-object))
word is encountered and the enabling condi- Enabling Conditions = ((active-voice))
tions satisfied, then the phrase found in the
specified syntactic constituent is extracted, Instantiated Concept Node
tested for the appropriate semantic con- Damage = “two mobile homes”
straints, and then labeled as an instance of the
designated concept type. The bottom of figure
3 shows the concept extracted from sentence
two of the sample text after applying the Dam-
age concept node. Alternatively, given the sen- Figure 3. Concept Node for Extracting Damage Information.
tence “the hurricane destroyed two office
buildings,” the same Damage concept node
would extract two office buildings as the dam- for a target noun phrase based on the syntactic
aged entities. The extracted concepts are used position in which the noun phrase appears
during merging and template generation to and the local linguistic context. The first pat-
tern that applies determines the extraction
produce the desired output templates.
pattern, that is, concept node, for extracting
AUTOSLOG learns concept-node definitions
the noun phrase from the training sentence.
using a one-shot learning algorithm designed
The linguistic pattern that would apply in the
specifically for the information-extraction
two mobile homes example is
task. As a training corpus, it requires a set of
texts and their answer keys.2 The AUTOSLOG <active-voice-verb> followed by <target-
learning algorithm is straightforward and de- np>=<direct object> .
pends only on the existence of a partial parser, This pattern says that the noun phrase to be
a small lexicon with semantic-class informa- extracted, that is, the target-np, appeared as
tion, and a small set (approximately 13) of the direct object of an active voice verb. Simi-
general linguistic patterns that direct the cre- lar patterns exist for the objects of passives and
ation of concept nodes. Given a noun phrase infinitives and for cases where the target noun
from an answer key, AUTOSLOG performs the fol- phrase appears as the subject of a clause or the
lowing steps to derive a concept node for ex- object of a prepositional phrase. AUTOSLOG’s lin-
tracting the phrase from the original text: guistic patterns are, for the most part, domain
First, find the sentence from which the independent; they need little or no modifica-
noun phrase originated. For example, given tion when moving an NLP system from one in-
the target noun phrase two mobile homes that formation-extraction task to another.
fills the Damage slot, AUTOSLOG would return Fourth, when a pattern applies, generate a
sentence two from the sample text during this concept-node definition from the matched
step. constituents, their context, the slot type for
Second, present the sentence to the partial the target noun phrase, and the predefined se-
parser for processing. AUTOSLOG’s partial parser mantic class for the filler. For our example, AU-
must be able to identify the subject, direct ob- TOSLOG would generate a concept node defini-

ject, verb group, and prepositional phrases of tion of the following form:
each clause. For sentence two of the sample Concept = < <slot type> of <target-np> >
text, the parser should determine, among oth- Trigger =
er things, that destroyed occurred as the verb “< <verb> of <active-voice-verb> >”
group of the third clause with two mobile homes
Position = direct-object
as its direct object.
Third, apply the linguistic patterns in order. Constraints =
AUTOSLOG’s linguistic patterns attempt to iden- ((< <semantic class> of <concept> >))
tify domain-specific thematic role information Enabling Conditions = ((active-voice)) .

WINTER 1997 71
Articles

view the proposed extraction patterns and dis-


card those that seem troublesome.
Annotated Corpus or Background AUTOSLOG has been used to automatically de-
Texts w/Answer Keys Knowledge rive extraction patterns for a number of do-
mains: terrorism, business joint ventures, and
target phrases + advances in microelectronics. In terms of im-
sentences that proving the portability of information-extrac-
contain them tion systems, AUTOSLOG allowed developers to
create extraction patterns for the terrorism do-
Sentence Analysis main in 5 hours instead of the approximately
or Preprocessing 1200 hours required to create extraction pat-
terns for the domain by hand. In terms of ac-
curacy, there was no direct empirical evalua-
tion of the learned patterns, even though some
form of cross-validation could have been used.
Instead, the learned patterns were evaluated
Training/Test indirectly—by using them to drive the Univer-
Case Generation sity of Massachusetts–MUC-3 information-ex-
traction system. In short, the AUTOSLOG-gener-
ated patterns achieved 98 percent of the
performance of the handcrafted patterns. This
Learning result is especially impressive because the
Algorithm handcrafted patterns had placed the Universi-
ty of Massachusetts system among the top per-
feedback
formers in the MUC-3 performance evaluation
loop (Lehnert et al. 1991). AUTOSLOG offered an addi-
tional advantage over the handcrafted rule set:
extraction patterns Because domain experts can review the auto-
matically generated extraction patterns with
minimal training, building the patterns no
Figure 4. Learning Information-Extraction Patterns.
longer required the expertise of a computa-
tional linguist with a deep understanding of
the underlying NLP system. This is a critical
AUTOSLOG assumes that the semantic class of
step toward building information-extraction
each concept type is given as part of the do-
systems that are trainable entirely by end
main specification and that the parser has a
users.
mechanism for assigning these semantic class-
Figure 4 shows the general structure of cor-
es to nouns and noun modifiers during sen-
pus-based approaches to learning information-
tence analysis. After substitutions, this con-
extraction patterns. AUTOSLOG conforms to this
cept-node definition will match the Damage structure except for its human feedback loop,
concept node of figure 3. which does not inform the learning algorithm
Some examples of extraction patterns of its findings. Virtually all subsequent at-
learned by AUTOSLOG for the terrorism domain tempts to automate the acquisition of extrac-
include (in shorthand form): <victim> was tion patterns also conform to the general struc-
murdered; <perpetrator> bombed; <perpetra- ture of figure 4. In the next paragraphs, I
tor> attempted to kill; was aimed at <target>. describe a handful of these systems.
In these examples, the bracketed items denote First, Kim and Moldovan’s (1995) PALKA sys-
concept type and the word in boldface is the tem learns extraction patterns that are similar
concept node trigger. Although many of AU- in form to AUTOSLOG’s concept nodes. The ap-
TOSLOG’s learned patterns are good, some are proach used to generate the patterns, however,
too general (for example, they are triggered by is quite different. The background knowledge
is or are); others are too specific; still others are is not a set of linguistic patterns to be instanti-
just wrong. These bad extraction patterns are ated but a concept hierarchy, a set of prede-
sometimes caused by parsing errors; alterna- fined keywords that can be used to trigger each
tively, they occur when target noun phrases pattern, and a semantic-class lexicon. The con-
occur in a prepositional phrase, and AUTOSLOG cept hierarchy contains generic semantic case-
cannot determine whether the preceding verb frame definitions for each type of information
or noun phrase should trigger the extraction. to be extracted. To learn extraction patterns,
As a result, AUTOSLOG requires that a person re- PALKA looks for sentences that contain case-

72 AI MAGAZINE
Articles

frame slots using semantic-class information. ing corpus. In addition, CRYSTAL and PALKA use
The training corpus is used to choose the cor- automated feedback for the learning algo-
rect mapping when more than one is possible, rithm; AUTOSLOG requires human perusal of
but the concept and semantic-class hierarchies proposed patterns. CRYSTAL and PALKA, however,
guide PALKA’s generalization and specialization require more background knowledge in the
of proposed patterns. form of a possibly domain-specific semantic-
Like AUTOSLOG and PALKA, CRYSTAL (Soderland class hierarchy, a lexicon that indicates seman-
et al. 1995) learns extraction patterns in the tic-class information for each word and, in the
form of semantic case frames. CRYSTAL’s pat- case of PALKA, a set of trigger words. The parsers
terns, however, can be more complicated. In- of both systems must also be able to accurately
stead of specifying a single trigger word and its assign semantic-class information to words in
local linguistic context, the triggers for CRYS- an incoming text. No semantic hierarchy is
TAL’s patterns comprise a much more detailed needed for AUTOSLOG—a flat semantic feature
specification of linguistic context. In particu- list will suffice. Also, although AUTOSLOG’s pat-
lar, the target constituent or any surrounding terns perform best when semantic-class infor-
constituents (for example, the subject, verb, or mation is available, the learning algorithm and
object of the current clause) can be tested for a the resulting concept nodes can still operate ef-
specific sequence of words or the presence of fectively when no semantic-class information
heads or modifiers with the appropriate se- can be obtained.
mantic class. CRYSTAL uses a covering algorithm There have been a few additional attempts
to learn extraction patterns and their relatively to learn extraction patterns. Huffman’s (1996)
complicated triggering constraints. Covering LIEP system learns patterns that recognize se-
algorithms are a class of inductive learning mantic relationships between two target noun
technique that successively generalizes input phrases, that is, between two slot fillers of an
examples until the generalization produces er- information-extraction output template. The
rors. As a result, CRYSTAL begins by generating patterns describe the syntactic context that
the most specific concept node possible for falls between the target noun phrases as well as
every phrase to be extracted in the training the semantic class of the heads of the target
texts. It then progresses through the concept phrases and all intervening phrases. I (Cardie
nodes one by one. For each concept node, C, 1993) used standard symbolic machine-learn-
CRYSTAL finds the most similar concept node, ing algorithms (decision tree induction and a k
C’, and relaxes the constraints of each just nearest-neighbor algorithm) to identify the
enough to unify C and C’. The new extraction trigger word for an extraction pattern, the gen-
pattern, P, is tested against the training corpus. eral linguistic context in which the pattern
If its error rate is less than some prespecified would be applied, and the type of concept that
threshold, P is added to the set, replacing C the pattern would identify. Califf and Mooney
and C’. The process is repeated on P until the (1997) have recently applied relational learn-
error tolerance is exceeded. At this point, CRYS- ing techniques to acquire extraction patterns
TAL moves on to the next pattern in the origi- from news group job postings. Like CRYSTAL,
nal set. CRYSTAL was initially used to derive ex- their RAPIER system operates by generalizing an
traction patterns for a medical diagnosis initial set of specific patterns. Unlike any of the
domain, where it achieved precision levels previously mentioned systems, however, RAPIER
ranging from 50 percent to 80 percent and re- learns patterns that specify constraints at the
call levels ranging from 45 percent to 75 per- word level rather than the constituent level. As
cent, depending on how the error-tolerance a result, only a part-of-speech tagger is re-
threshold was set. quired to process input texts.
Although AUTOSLOG, PALKA, and CRYSTAL learn Although much progress has been made on
extraction patterns in the form of semantic learning extraction patterns, many research is-
case frames, each uses a different learning sues still need to be resolved. Existing methods
strategy. AUTOSLOG creates extraction patterns work well when the information to be extract-
by specializing a small set of general linguistic ed is explicitly denoted as a string in the text,
patterns, CRYSTAL generalizes complex but max- but major extensions would be required to
imally specific linguistic contexts, and PALKA handle set fills. Furthermore, existing methods
performs both generalization and specializa- focus on the extraction of noun phrases. It is
tion of an initial extraction pattern. Where AU- not clear that the same methods would work
TOSLOG makes no attempt to limit the number well for domains in which the extracted infor-
of extraction patterns created, CRYSTAL’s cover- mation is another syntactic type or is a compo-
ing algorithm derives the minimum number of nent of a constituent rather than a complete
patterns that cover the examples in the train- constituent (for example, a group of noun

WINTER 1997 73
Articles

modifiers in a noun phrase). Finally, few of the empirically, and as a result, it is not clear what
methods described here have been evaluated information to include in the heuristics. It is
on the same information-extraction tasks un- also difficult to design heuristics that combine
der the same conditions. Until a direct com- multiple coreference cues effectively, given
parison of techniques is available, it will re- that the relative importance of each piece of
main difficult to determine the relative information is unknown. Furthermore, most
advantages of one technique over another. A computational approaches to coreference reso-
related open problem in the area is to deter- lution assume as input fully parsed sentences,
mine, a priori, which method for learning ex- often marked with additional linguistic attrib-
traction patterns will give the best results in a utes such as grammatical function and themat-
new extraction domain. ic role information. Information-extraction
systems do not normally have such detailed
parse information available: The robust partial
Coreference Resolution and parsing algorithms used by most information-
Template Generation extraction systems offer wider coverage in ex-
change for less syntactic information. A fur-
In comparison to empirical methods for learn-
ther complication in developing trainable
ing extraction patterns, substantially less re-
coreference components for an information-
search has tackled the problems of coreference
extraction system is that discourse analysis is
resolution and template generation. As men-
based on information discerned by earlier
tioned earlier, the goal of the coreference com-
phases of processing. Thus, any coreference al-
ponent is to determine when two phrases refer
gorithm must take into account the accumu-
to the same entity. Although this task might
lated errors of the earlier phases as well as the
not appear difficult, consider the following
… the goal fact that some information that would aid the
text from the MUC-6 (1995) corporate man-
coreference task might be missing. Finally, the
of the agement succession domain. In this text, all
coreference component of an information-ex-
the bracketed segments are coreferential:
coreference [Motor Vehicles International Corp.]
traction system must be able to handle the
myriad forms of coreference across different
component is announced a major management shake- domains.
to determine up.... [MVI] said the chief executive officer Empirical methods for coreference were de-
has resigned.... [The Big 10 auto maker] is signed to address these problems. Unlike the
when two attempting to regain market share.... [It] methods for learning extraction patterns, algo-
phrases refer will announce significant losses for the rithms for building automatically trainable
fourth quarter.... A [company] spokesman coreference-resolution systems have not re-
to the same said [they] are moving [their] operations quired the development of learning algorithms
entity. to Mexico in a cost-saving effort.... [MVI, designed specifically for the task. By recasting
[the first company to announce such a the coreference problem as a classification
move since the passage of the new inter- task, any of a number of standard inductive-
national trade agreement],] is facing in- learning algorithms can be used. Given two
creasing demands from unionized work- phrases and the context in which they occur,
ers.... [Motor Vehicles International] is for example, the coreference-learning algo-
[the biggest American auto exporter to rithm must classify the phrases with respect to
Latin America]. whether they refer to the same object. Here, I
The passage shows the wide range of linguis- describe two systems that use inductive-classi-
tic phenomena that influence coreference res- fication techniques to automatically acquire
olution, including proper names, aliases, defi- coreference-resolution heuristics: MLR (ma-
nite noun phrases, definite descriptions, chine-learning–based resolver) (Aone and Ben-
pronouns, predicate nominals, and apposi- nett 1995) and RESOLVE (McCarthy and Lehnert
tives. Unfortunately, different factors can play 1995).
a role in handling each type of reference. In Both MLR and RESOLVE use the same general
fact, discourse processing, and coreference in approach, which is depicted in figure 5. First, a
particular, has been cited as a major weakness training corpus is annotated with coreference
of existing information-extraction systems. information; namely, all phrases that refer to
One problem is that most systems use manual- the same object are linked using the annota-
ly generated heuristics to determine when two tions. Alternatively, just the best (usually the
phrases describe the same entity, but generat- most recent) antecedent for each referent is
ing good heuristics that cover all types of ref- marked. Training examples for presentation to
erence resolution is challenging. In particular, the machine-learning algorithm are then cre-
few discourse theories have been evaluated ated from the corpus. There will be one in-

74 AI MAGAZINE
Articles

Application Phase

novel text

IE System

phrase-1 phrase-2 context : ?


Training Phase
IE System phrase-1 phrase-2 context : coref
phrase-1 phrase-3 context : coref
Inductive
phrase-1 phrase-4 context : not-coref
...
Learning
Early last evening, a
phrase-2 phrase-3 context : not-coref Algorithm not- not- not-
tornado swept through an coref coref
phrase-2 phrase-4 context : not-coref coref
area northwest of Dallas...
... Train coref not-
coref coref
the twister ... coref
Learning
Create Corpus: For each Generate Training Data: Algorithm
text, link phrases that Create one instance for
refer to the same objects. each pair of referents in coreference decision
the training texts.

Figure 5. A Machine-Learning Approach to Coreference Resolution.

stance for every possible pairing of referents in shown as the application phase in figure 5. The
the training texts: Some of these are positive ex- information-extraction system processes a new
amples in that they correspond to phrases that text and reaches a point where coreference de-
are coreferent; others are negative examples in cisions must be made. For each such decision,
that they correspond to phrases that are not re- the NLP system creates a test instance. The test
ferring to the same object. The exact form of instance uses the same feature set as the train-
the instances depends on the learning algo- ing instances: Its features describe a discourse
rithm, but for the inductive-learning algo- entity, its possible antecedent, and their shared
rithm used by MLR and RESOLVE, the training ex- context. The test instance is given to the
amples contain (1) a list of features, or learned concept description for classification
attribute-value pairs, that describe the phrases as either coreferent or not, and the decision is
under consideration and the context in which returned to the information-extraction system.
they occur and (2) supervisory information in Both MLR and RESOLVE use this general
the form of a class value that indicates whether method for automatically constructing coref-
the two phrases are coreferent. The specific fea- erence components for their information-ex-
tures used depend on the kinds of information traction systems. Both use the widely available
available to the information-extraction system C4.5 decision-tree–induction system (Quinlan
when the coreference decision must be made. 1992) as the inductive-learning component.
More details on the creation of training data There are, however, a number of differences in
are given later. how each system instantiated and evaluated
Once the data set has been derived from the the general approach of figure 5. McCarthy
corpus, it is presented to the machine-learning tested RESOLVE on the MUC-5 business joint-
algorithm, which uses the examples to derive venture corpus (English version), and Aone
a concept description for the coreference-reso- and Bennett tested MLR on the Japanese corpus
lution task. Figure 5 shows a concept descrip- for the same information-extraction domain.
tion in the form of a decision tree, but the ac- The evaluation of MLR focused on anaphors in-
tual form depends on the particular learning volving entities tagged as organizations (for
algorithm employed. The idea is that after example, companies, governments) by the
training, this concept description can be used sentence-analysis phase of their information-
to decide whether two phrases in an unseen extraction system. The evaluation of RESOLVE
text refer to the same object. This process is focused more specifically on organizations

WINTER 1997 75
Articles

that had been identified as a party in a joint well because a single feature—the character
venture by the extraction component. subsequence feature—can reliably predict
Both systems used feature representations coreference for phrases that are proper names
that relied only on information that earlier for organizations, which make up almost half
phases of analysis could provide. However, of the instances in the data set. Performance
MLR’s data set was generated automatically by on non–proper-name organization referents
its information-extraction system, while RE- was much lower. Definite noun phrases, for ex-
SOLVE was evaluated using a manually generat- ample, reached only 44-percent recall and 60-
ed, noise-free data set. In addition, the feature percent precision. Nevertheless, an important
sets of each system varied markedly. MLR’s result for both MLR and RESOLVE was that each
training and test instances were described in significantly outperformed a coreference sys-
terms of 66 features that describe (1) lexical tem that had been developed manually for
features of each phrase (for example, whether their information-extraction systems.
one phrase contains a character subsequence In a subsequent evaluation, the RESOLVE sys-
of the other), (2) the grammatical role of the tem competed in the MUC-6 coreference com-
phrases, (3) semantic-class information, and petition where it achieved scores of 41-percent
(4) relative positional information. Although to 44-percent recall and 51-percent to 59-per-
all attributes of the MLR representation are do- cent precision after training on only 25 texts.
main independent, the values for some attrib- This result was somewhat below the five best
utes can be domain specific. systems, which achieved 51-percent to 63-per-
RESOLVE ’s instance representation, on the cent recall and 62-percent to 72-percent preci-
other hand, contains a number of features that sion. All the better-performing systems, how-
are unabashedly domain specific. Its represen- ever, used manually encoded coreference
tation includes eight features including algorithms. Like some of the manually coded
whether each phrase contains a proper name systems, RESOLVE only attempted to resolve ref-
(two features), whether one or both phrases re- erences to people and organizations. In fact, it
fer to the entity formed by a joint venture was estimated that a good proper name–alias
(three features), whether one phrase contains recognizer would have produced a coreference
an alias of the other (one feature), whether the system with relatively good performance—
phrases have the same base noun phrase (one about 30-percent recall and, possibly, 90-per-
feature), and whether the phrases originate cent precision. One should note, however, that
from the same sentence (one feature). Note the interannotator agreement for marking
that a number of RESOLVE’s features correspond coreference in 17 articles was found to be 80-
to those used in MLR, for example, the alias fea- percent recall and 82-percent precision, with
ture of RESOLVE versus the character subse- definite descriptions (for example, [MVI, [the
quence feature of MLR. first company to announce such a move since
RESOLVE and MLR were evaluated using data the passage of the new international trade
sets derived from 50 and 250 texts, respective- agreement]]) and bare nominals (for example,
ly. RESOLVE achieved recall and precision levels “A [company] spokesman”) accounting for
of 80 percent to 85 percent and 87 percent to most of the discrepancies.
92 percent (depending on whether the deci- Overall, the results for coreference resolu-
sion tree was pruned). A baseline system that tion are promising. They show that it is possi-
always assumed that the candidate phrases ble to develop automatically trainable corefer-
were not coreferent would also achieve rela- ence systems that can compete favorably with
tively high scores given that negative examples manually designed systems. In addition, they
made up 74 percent of the RESOLVE data set. MLR, show that specially designed learning algo-
however, achieved recall and precision levels rithms need not be developed because stan-
of 67 percent to 70 percent and 83 percent to dard machine-learning algorithms might be
88 percent (depending on the parameter set- up to the challenge. There is an additional ad-
tings of the training configuration). For both vantage to applying symbolic machine-learn-
MLR and RESOLVE, recall and precision are mea- ing techniques to problems in natural lan-
sured with respect to the coreference task only, guage understanding: They offer a mechanism
not the full information-extraction task. for evaluating the usefulness of different
Without additional experiments, it is impos- knowledge sources for any task in an NLP sys-
sible to know whether the differences in results tem that can be described as a classification
depend on the language (English versus Japan- problem. Examination of the coreference deci-
ese), the slight variations in training-testing sion trees created by C4.5, for example, will in-
methodology, the degree of noise in the data, dicate which knowledge sources are more im-
or the feature sets used. Interestingly, MLR does portant for the task: The knowledge source

76 AI MAGAZINE
Articles

corresponding to a feature tested at node i in systems that used manually generated merging
the tree is probably more important than the and template-generation subsystems. Addi-
knowledge sources corresponding to the fea- tional research is needed to determine the fea-
tures tested below it in the tree. Furthermore, sibility of an entirely trainable discourse com-
once the data set is created, it is a simple task ponent. Finally, statistical approaches to
to run multiple variations of the learning algo- template merging are also beginning to sur-
rithm, giving each variation access to a differ- face. Kehler (1997), for example, introduced a
ent subset of features. As a result, empirical method for assigning a probability distribution
methods offer data-driven feedback for linguis- to coreference relationships, as encoded in
tic theories and system developers alike. competing sets of output templates. His initial
Still, much research remains to be done. The experiments indicate that the method com-
machine-learning approach to coreference pares favorably with the greedy approach to
should be tested on additional types of template merging that is used in SRI Interna-
anaphor using a variety of feature sets, includ- tional’s FASTUS information-extraction system
ing feature sets that require no domain-specific (Appelt et al. 1995).
information. In addition, if the approach is to
offer a general, task-independent solution to
the coreference problem, then the role of do-
Future Directions
main-specific information for coreference res- Research in information extraction is new. Re-
olution must be determined empirically, and search in applying learning algorithms to
the methods must be evaluated outside the problems in information extraction is even Research in
context of information extraction. The relative newer: We are only beginning to understand information
effect of errors from the preceding phases of the techniques for automatically acquiring
text analysis on learning algorithm perfor- both domain-independent and domain-de-
extraction
mance must also be investigated. pendent knowledge for these task-driven sys- is new.
There have been few attempts to use empir- tems. As a result, the field can take any number
ical methods for other discourse-level prob- of exciting directions. First, like the trends in
Research in
lems that arise in information extraction. BBN statistical language learning, a next step would applying
Systems and Technologies has developed a be to explore unsupervised learning algo- learning
probabilistic method for determining para- rithms as a means for sidestepping the lack of
graph relevance for its information-extraction large, annotated corpora for information-ex- algorithms
system (Weischedel et al. 1993); it then uses traction tasks. In general, there is a dirth of to problems
the device to control the recall-precision trade- learning algorithms that deal effectively with
off. I have used symbolic machine-learning the relatively small amounts of data available in
techniques to learn relative pronoun disam- to developers of information-extraction sys- information
biguation heuristics (Cardie 1992a, 1992b). tems. A related but slightly different direction
Thus, the information-extraction system can of research is to focus on developing tech-
extraction
process a sentence such as “Castellar was kid- niques that allow end users to quickly train in- is even
napped by members of the ELN, who attacked formation-extraction systems for their own
the mayor in his office” and infer that members
newer ….
needs through interaction with the system
of the ELN is the actor of the kidnapping as well over time, completely eliminating the need for
as the implicit actor of the attack in the second intervention by NLP system developers. Many
clause. Two trainable systems that simultane- new learning methods will be needed to suc-
ously tackle merging and template generation ceed in this task, not the least of which are
have also been developed: TTG (Dolan et al. techniques that make direct use of the answer
1991) and WRAP-UP (Soderland and Lehnert keys of an information-extraction training cor-
1994). Both systems generate a series of deci- pus to automatically tune every component of
sion trees, each of which handles some piece the extraction system for a new domain. Final-
of the template-generation or merging tasks, ly, the robustness and generality of current
for example, deciding whether to merge two learning algorithms should be investigated
templates into one or deciding when to split and extended by broadening the definition of
an existing template into two or more tem- information extraction to include the extrac-
plates. WRAP-UP used 91 decision trees to make tion of temporal, causal, or other complex re-
these decisions for the MUC-5 microelectron- lationships among events. The demand for in-
ics domain based on features of the entities ex- formation-extraction systems in industry,
tracted from each clause in an input text. Un- government, and education and for personal
fortunately, the information-extraction use is spiraling as more and more text becomes
systems that used these trainable discourse available online. The challenge for empirical
components did not perform nearly as well as methods in NLP is to continue to match this

WINTER 1997 77
Articles

demand by developing additional natural lan- ference on Artificial Intelligence, 117–124. Menlo
guage–learning techniques that replace manu- Park, Calif.: American Association for Artificial Intel-
al coding efforts with automatically trainable ligence.
components and that make it increasingly Charniak, E. 1993. Statistical Language Learning.
faster and easier to build accurate and robust Cambridge, Mass.: MIT Press.
information-extraction systems in new do- Chinchor, N.; Hirschman, L.; and Lewis, D. 1993.
mains. Evaluating Message-Understanding Systems: An
Analysis of the Third Message-Understanding Con-
Acknowledgments ference (MUC-3). Computational Linguistics 19(3):
409–449.
Preparation of this article was supported in
Craven, M.; Freitag, D.; McCallum, A.; Mitchell, T.;
part by National Science Foundation CAREER
Nigam, K.; and Quek, C. Y. 1997. Learning to Extract
Award IRI–9624639. Symbolic Knowledge from the World Wide Web, In-
ternal report, School of Computer Science, Carnegie
Notes
Mellon University.
1. The development phase has varied from year to
Daelemans, W.; Zavrel, J.; Berck, P.; and Gillis, S.
year but has ranged from about one to nine months.
1996. MBT: A Memory-Based Part-of-Speech Tagger-
2. A newer version of AUTOSLOG requires only that in- Generator. In Proceedings of the Fourth Workshop
dividual texts are marked as relevant or irrelevant to on Very Large Corpora, eds. E. Ejerhed and I. Dagan,
the domain (Riloff 1996). The learned concept nodes 14–27. Copenhagen: ACL SIGDAT.
are then labeled according to type by hand.
Dolan, C.; Goldman, S.; Cuda, T.; and Nakamura, A.
1991. Hughes Trainable Text Skimmer: Description
References of the TTS System as Used for MUC-3. In Proceedings
Aone, C., and Bennett, W. 1995. Evaluating Auto- of the Third Message-Understanding Conference
mated and Manual Acquisition of Anaphora Resolu- (MUC-3), 155–162. San Francisco, Calif.: Morgan
tion Strategies. In Proceedings of the Thirty-Third Kaufmann.
Annual Meeting of the Association for Computa- Glasgow, B.; Mandell, A.; Binney, D.; Ghemri, L.; and
tional Linguistics, 122–129. Somerset, N.J.: Associa- Fisher, D. 1997. MITA: An Information-Extraction Ap-
tion for Computational Linguistics. proach to Analysis of Free-Form Text in Life Insur-
Appelt, D. E.; Hobbs, J. R.; Bear, J.; Israel, D.; Kameya- ance Applications. In Proceedings of the Ninth Con-
ma, M.; Kehler, A.; Martin, D.; Myers, K.; and Tyson, ference on Innovative Applications of Artificial
M. 1995. SRI International FASTUS System: MUC-6 Intelligence, 992– 999. Menlo Park, Calif.: American
Test Results and Analysis. In Proceedings of the Sixth Association for Artificial Intelligence.
Message-Understanding Conference (MUC-6), 237–248. Holowczak, R. D., and Adam, N. R. 1997. Informa-
San Francisco, Calif.: Morgan Kaufmann. tion Extraction–Based Multiple-Category Document
Brill, E. 1995. Transformation-Based Error-Driven Classification for the Global Legal Information Net-
Learning and Natural Language Processing: A Case work. In Proceedings of the Ninth Conference on In-
Study in Part-of-Speech Tagging. Computational Lin- novative Applications of Artificial Intelligence,
guistics 21(4): 543–565. 1013–1018. Menlo Park, Calif.: American Associa-
Califf, M. E., and Mooney, R. J. 1997. Relational tion for Artificial Intelligence.
Learning of Pattern-Match Rules for Information Ex- Huffman, S. 1996. Learning Information-Extraction
traction. In Proceedings of the ACL Workshop on Patterns from Examples. In Symbolic, Connectionist,
Natural Language Learning, 9–15. Somerset, N.J.: As- and Statistical Approaches to Learning for Natural Lan-
sociation for Computational Linguistics. guage Processing, eds. S. Wermter, E. Riloff, and G.
Cardie, C. 1993. A Case-Based Approach to Knowl- Scheler, 246–260. Lecture Notes in Artificial Intelli-
edge Acquisition for Domain-Specific Sentence gence Series. New York: Springer.
Analysis. In Proceedings of the Eleventh National Kehler, A. 1997. Probabilistic Coreference in Infor-
Conference on Artificial Intelligence, 798–803. Men- mation Extraction. In Proceedings of the Second
lo Park, Calif.: American Association for Artificial In- Conference on Empirical Methods in Natural Lan-
telligence. guage Processing, eds. C. Cardie and R. Weischedel,
Cardie, C. 1992a. Corpus-Based Acquisition of Rela- 163–173. Somerset, N.J.: Association for Computa-
tive Pronoun Disambiguation Heuristics. In Proceed- tional Linguistics.
ings of the Thirtieth Annual Meeting of the ACL, Kim, J.-T., and Moldovan, D. I. 1995. Acquisition of
216–223. Somerset, N.J.: Association for Computa- Linguistic Patterns for Knowledge-Based Informa-
tional Linguistics. tion Extraction. IEEE Transactions on Knowledge and
Cardie, C. 1992b. Learning to Disambiguate Relative Data Engineering 7(5): 713–724.
Pronouns. In Proceedings of the Tenth National Lehnert, W. 1990. Symbolic-Subsymbolic Sentence
Conference on Artificial Intelligence, 38–43. Menlo Analysis: Exploiting the Best of Two Worlds. In Ad-
Park, Calif.: American Association for Artificial Intel- vances in Connectionist and Neural Computation Theo-
ligence. ry, eds. J. Barnden and J. Pollack, 135–164. Norwood,
Cardie, C., and Lehnert, W. 1991. A Cognitively N.J.: Ablex.
Plausible Approach to Understanding Complicated Lehnert, W., and Sundheim, B. 1991. A Performance
Syntax. In Proceedings of the Ninth National Con- Evaluation of Text Analysis Technologies. AI Maga-

78 AI MAGAZINE
Articles

zine 12(3): 81–94. Knowledge Acquisition for Discourse Analysis. In


Lehnert, W.; Cardie, C.; Fisher, D.; Riloff, E.; and Proceedings of the Twelfth National Conference on
Williams, R. 1991. University of Massachusetts: De- Artificial Intelligence, 827–832. Menlo Park, Calif.:
scription of the CIRCUS System as Used in MUC-3. In American Association for Artificial Intelligence.
Proceedings of the Third Message-Understanding Confer- Soderland, S.; Fisher, D.; Aseltine, J.; and Lehnert, W.
ence (MUC-3), 223–233. San Francisco, Calif.: Morgan 1995. CRYSTAL: Inducing a Conceptual Dictionary. In
Kaufmann. Proceedings of the Fourteenth International Joint
Lehnert, W.; Cardie, C.; Fisher, D.; McCarthy, J.; Conference on Artificial Intelligence, 1314–1319.
Riloff, E.; and Soderland, S. 1992. University of Mass- Menlo Park, Calif.: International Joint Conferences
achusetts: Description of the CIRCUS System as Used on Artificial Intelligence.
in MUC-4. In Proceedings of the Fourth Message-Under- Soderland, S.; Aronow, D.; Fisher, D.; Aseltine, J.; and
standing Conference (MUC-4), 282–288. San Francisco, Lehnert, W. 1995. Machine Learning of Text-Analy-
Calif.: Morgan Kaufmann. sis Rules for Clinical Records, Technical Report, TE-
McCarthy, J. F., and Lehnert, W. G. 1995. Using De- 39, Department of Computer Science, University of
Massachusetts.
cision Trees for Coreference Resolution. In Proceed-
ings of the Fourteenth International Conference on Thompson, C. A.; Mooney, R. J.; and Tang, L. R.
Artificial Intelligence, ed. C. Mellish, 1050–1055. 1997. Learning to Parse Natural Language Database
Menlo Park, Calif.: International Joint Conferences Queries into Logical Form. In Proceedings of the ML-
on Artificial Intelligence. 97 Workshop on Automata Induction, Grammatical
Inference, and Language Acquisition. Somerset, N.J.:
Magerman, D. M. 1995. Statistical Decision-Tree
Association for Computational Linguistics.
Models for Parsing. In Proceedings of the Thirty-
Third Annual Meeting of the ACL, 276–283. Somer- Weischedel, R.; Meteer, M.; Schwartz, R.; Ramshaw,
set, N.J.: Association for Computational Linguistics. L.; and Palmucci, J. 1993. Coping with Ambiguity
and Unknown Words through Probabilistic Models.
Marcus, M.; Marcinkiewicz, M.; and Santorini, B.
Computational Linguistics 19(2): 359–382.
1993. Building a Large Annotated Corpus of English:
The Penn Tree Bank. Computational Linguistics 19(2): Weischedel, R.; Ayuso, D.; Boisen, S.; Fox, H.; Mat-
313–330. sukawa, T.; Papageorgiou, C.; MacLaughlin, D.;
Sakai, T.; Abe, H. J. H.; Miyamoto, Y.; and Miller, S.
MUC-3. 1991. Proceedings of the Third Message-Under- 1993. BBN’s PLUM Probabilistic Language-Under-
standing Conference (MUC-3). San Francisco, Calif.: standing System. In Proceedings, TIPSTER Text Program
Morgan Kaufmann. (Phase I), 195–208. San Francisco, Calif.: Morgan
MUC-4. 1992. Proceedings of the Fourth Message-Un- Kaufmann.
derstanding Conference (MUC-4). San Francisco, Calif.: Will, C. A. 1993. Comparing Human and Machine
Morgan Kaufmann. Performance for Natural Language Information Ex-
MUC-5. 1994. Proceedings of the Fifth Message-Under- traction: Results from the TIPSTER Text Evaluation. In
standing Conference (MUC-5). San Francisco, Calif.: Proceedings, TIPSTER Text Program (Phase I), 179–194.
Morgan Kaufmann. San Francisco, Calif.: Morgan Kaufmann.
MUC-6. 1995. Proceedings of the Sixth Message-Under- Zelle, J., and Mooney, R. 1994. Inducing Determinis-
standing Conference (MUC-6). San Francisco, Calif.: tic Prolog Parsers from Tree Banks: A Machine-Learn-
Morgan Kaufmann. ing Approach. In Proceedings of the Twelfth Nation-
Quinlan, J. R. 1992. C4.5: Programs for Machine Learn- al Conference on Artificial Intelligence, 748–753.
ing. San Francisco, Calif.: Morgan Kaufmann. Menlo Park, Calif.: American Association for Artifi-
cial Intelligence.
Ramshaw, L. A., and Marcus, M. P. 1995. Text
Chunking Using Transformation-Based Learning. In
Proceedings of the Thirty-Third Annual Meeting of Claire Cardie is an assistant pro-
the ACL, 82–94. Somerset, N.J.: Association for Com- fessor in the Computer Science
putational Linguistics. Department at Cornell University.
Riloff, E. 1996. Automatically Generating Extraction She received her Ph.D. in 1994
Patterns from Untagged Text. In Proceedings of the from the University of Massachu-
Thirteenth National Conference on Artificial Intelli- setts at Amherst. Her current re-
gence, 1044–1049. Menlo Park, Calif.: American As- search is in natural language learn-
sociation for Artificial Intelligence. ing, case-based learning, and the
application of natural
Riloff, E. 1993. Automatically Constructing a Dictio-
language–understanding techniques to problems in
nary for Information-Extraction Tasks. In Proceed-
information retrieval. Her e-mail address is
ings of the Eleventh National Conference on Artifi-
cardie@cs.cornell.edu.
cial Intelligence, 811–816. Menlo Park, Calif.:
American Association for Artificial Intelligence.
Soderland, S. 1997. Learning to Extract Text-Based
Information from the World Wide Web. In Proceed-
ings of the Third International Conference on Knowledge
Discovery and Data Mining (KDD-97), 251–254. Men-
lo Park, Calif.: AAAI Press.
Soderland, S., and Lehnert, W. 1994. Corpus-Driven

WINTER 1997 79

You might also like