Professional Documents
Culture Documents
Machine Learning PDF
Machine Learning PDF
80
ward. On average the Immediate spider takes nearly
70 three times as long as Future to find the first 28 (5%)
60 of the papers. However, after the first 50% of the papers
50 are found, the Immediate spider performs slightly better,
40 because many links with immediate reward have been
discovered, and the Immediate spider recognizes them
30 RL Immediate
RL Future more accurately. In ongoing work we are improving the
20 Breadth-First accuracy of the classification when there is future reward
10 and a larger number of bins. We have also run experi-
0 ments on tasks with a single target page, where future
0 10 20 30 40 50 60 70 80 90 100
Percent Hyperlinks Followed reward decisions are more crucial. In this case, the Fu-
ture spider retrieves target pages twice as efficiently as
Figure 2: Average performance of two reinforcement the Immediate spider [Rennie and McCallum, 1999].
learning spiders versus traditional breadth-first search.
3 Classification into a Hierarchy by
ommend relevant hyperlinks to the user. Laser uses Bootstrapping with Keywords
reinforcement learning to tune the search parameters of
a search engine [Boyan et al., 1996]. Topic hierarchies are an efficient way to organize and
view large quantities of information that would other-
2.1 Experimental Results wise be cumbersome. As Yahoo has shown, a topic
In August 1998 we fully mapped the CS department web hierarchy can be an integral part of a search engine. For
sites at Brown University, Cornell University, University Cora, we have created a 70-leaf hierarchy of computer sci-
of Pittsburgh and University of Texas. They include ence topics, the top part of which is shown in Figure 1.
53,012 documents and 592,216 hyperlinks. We perform Creating the hierarchy structure and selecting just a few
four test/train splits, where the data from three univer- keywords associated with each node took about three
sities is used to train a spider that is tested on the fourth. hours, during which an expert examined conference pro-
The target pages (for which a reward of 1 is given) are ceedings and computer science Web sites.
computer science research papers, identified separately A much more difficult and time-consuming part of cre-
by a simple hand-coded algorithm with high precision. ating a hierarchy is placing documents into the correct
We currently train the agent off-line. We find all tar- topic nodes. Yahoo has hired many people to catego-
get pages in the training data, and calculate the Q-value rize web pages into their hierarchy. In contrast, ma-
associated with each hyperlink as the discounted sum chine learning can automate this task with supervised
of rewards that result from executing the optimal pol- text classification. However, acquiring enough labeled
icy (as determined by full knowledge of the web graph). training documents to build an accurate classifier is of-
The agent could also learn from experience on-line us- ten prohibitively expensive.
ing TD-λ. A spider is evaluated on each test/train split In this paper, we ease the burden on the classifier
by having it spider the test university, starting at the builder by using only unlabeled data, some keywords
department’s home page. In Figure 2 we report results and the class hierarchy. Instead of asking the builder
from traditional Breadth-First search as well as two dif- to hand-label training examples, the builder simply pro-
ferent reinforcement learners. Immediate uses γ = 0, vides a few keywords for each category. A large collection
and represents the Q-function as a binary classifier be- of unlabeled documents are then preliminarily labeled
tween immediate rewards and other hyperlinks. Future by using the keywords as a rule-list classifier (searching
uses γ = 0.5, and represents the Q-function with a more a document for each keyword and placing it in the class
finely-discriminating 3-bin classifier that uses future re- of the first keyword found). These preliminary labels
wards. are noisy, and many documents remain unlabeled. How-
At all times during its progress, both reinforcement ever, we then bootstrap an improved classifier. Using the
learning spiders have found more research papers than documents and preliminary labels, we initialize a naive
breadth-first search. One measure of performance is the Bayes text classifier from the preliminary labels. Then,
number of hyperlinks followed before 75% of the research Expectation-Maximization [Dempster et al., 1977] esti-
papers are found. Both reinforcement learners are signif- mates labels of unlabeled documents and re-estimates
icantly more efficient, requiring exploration of less than labels of keyword-labeled documents. Statistical shrink-
16% of the hyperlinks; in comparison, Breadth-first re- age is also incorporated in order to improve parameter
quires 48%. This represents a factor of three increase in estimates by using the class hierarchy. In this paper we
spidering efficiency. combine for the first time in one document classifier both
Note that the Future reinforcement learning spider EM for unlabeled data and hierarchical shrinkage.
3.1 Bayesian Text Classification into any category, and were discarded. In these experi-
We use the framework of multinomial naive Bayes text ments, we used the title, author, institution, references,
classification. The parameters of this model are, for each and abstracts of papers for classification, not the full
class cj , the frequency with which each word wt occurs, text.
P(wt |cj ), and the relative document frequency of each Traditional naive Bayes with 400 labeled training doc-
class, P(cj ). Given estimates of these parameters and a uments, tested in a leave-one-out fashion, results in 47%
document di, we can determine the probability that it classification accuracy. However, less than 100 docu-
belongs in class cj by Bayes’ rule: ments could have been hand-labeled in the 90 minutes
|di | it took to create the keyword-lists; using this smaller
Y training set results in only 30% accuracy. The rule-
P(cj |di) ∝ P(cj ) P(wdik |cj ), (1)
list classifier based on the keywords alone provides 45%.
k=1
We now turn to our bootstrap approach. When these
where wdik is the word wt that occurs in the kth posi- noisy keyword-labels are used to train a traditional naive
tion of document di . Training a standard naive Bayes Bayes text classifier, 47% accuracy is reached on the test
classifier requires a set of documents, D, and their class set. The full algorithm, including EM and hierarchi-
labels. The estimate of a word frequency is simply the cal shrinkage, achieves 66% accuracy. As an interesting
smoothed frequency with which the word occurs in train- comparison, human agreement between two people on
ing documents from the class: the test set was 72%.
P These results demonstrate the utility of the bootstrap-
1 + di ∈D N (wt , di)P(cj |di)
P(wt |cj ) = P|V | P , (2) ping approach. Keyword matching alone is noisy, but
|V | + s=1 di ∈D N (ws , di)P(cj |di) when naive Bayes, EM and hierarchical shrinkage are
used together as a regularizer, the resulting classification
where N (wt , di) is the number of times word wt occurs accuracy is close to human agreement levels. Automat-
in document di, P(cj |di) is an indicator of whether doc- ically creating preliminary labels, either from keywords
ument di belongs in class cj , and |V | is the number of or other sources, avoids the significant human effort of
words in the vocabulary. Similarly, the class frequen- hand-labeling training data.
cies P(cj ) are smoothed document frequencies estimated In future work we plan to refine our probabilistic
from the data. model to allow for documents to be placed in interior hi-
When a combination of labeled and unlabeled data is erarchy nodes, documents to have multiple class assign-
available, past work has shown that naive Bayes param- ments, and multiple mixture components per class. We
eter estimates can be improved by using EM to combine are also investigating principled methods of re-weighting
evidence from all the data [Nigam et al., 1999]. In our the word features for “semi-supervised” clustering that
bootstrapping approach, an initial naive Bayes model will provide better discriminative training with unla-
is estimated from the preliminarily-labeled data. Then beled data.
EM iterates until convergence (1) labeling all the data
(Equation 1) and (2) rebuilding a model with all the data
(Equation 2). The preliminary labels serve to provide a 4 Information Extraction
good starting point; EM then incorporates the unlabeled
data and re-estimates the preliminary labels. Information extraction is concerned with identifying
When the classes are organized hierarchically, as is our phrases of interest in textual data. In the case of a
case, naive Bayes parameter estimates can be improved search engine over research papers, the automatic ex-
with the statistical technique shrinkage [McCallum et traction of informative text segments can be used to (1)
al., 1998]. New word frequency parameter estimates for allow searches over specific fields, (2) provide useful ef-
a class are calculated by a weighted average between the fective presentation of search results (e.g. showing title
class’s local estimates, and estimates of its ancestors in in bold), and (3) match references to papers. We have
the hierarchy (each formed by pooling data from all the investigated techniques for extracting the fields relevant
ancestor’s children). The technique balances a trade- to research papers, such as title, author, and journal,
off between the specificity of the unreliable local word from both the headers and reference sections of papers.
frequency estimates and the reliability of the more gen- Our information extraction approach is based on hid-
eral ancestor’s frequency estimates. The optimal mix- den Markov models (HMMs) and their accompanying
ture weights for the weighted average are calculated by search techniques that are widely used for speech recog-
EM concurrently with the class labels. nition and part-of-speech tagging [Rabiner, 1989]. Dis-
crete output, first-order HMMs are composed of a set of
3.2 Experimental Results states Q, which emit symbols from a discrete vocabulary
Now we describe results of classifying computer science Σ, and a set of transitions between states (q → q 0 ). A
research papers into our 70-leaf hierarchy. A test set common goal of learning problems that use HMMs is to
was created by hand-labeling a random sample of 625 recover the state sequence V (x|M ) that has the high-
research papers from the 30,682 papers formerly com- est probability of having produced some observation se-
prising the entire Cora archive. Of these, 225 did not fit quence x = x1 x2 . . . xl ∈ Σ∗ :
0.86 up to either the first section of the paper, usually the
0.80
introduction, or to the end of the first page, whichever
0.14 occurs first. A single token, either +INTRO+ or +PAGE+,
institution tech 0.80 is added to the end of each header to indicate the case
0.03 0.20 with which it terminated. Likewise, the abstract is auto-
0.03 0.91 matically located and substituted with the single token
0.85 institution +ABSTRACT+. A few special classes of words are identi-
0.03 editor 0.37 fied using simple regular expressions and converted to
0.82 0.89 0.20 tokens such as +EMAIL+. All punctuation, case and new-
title
0.09
0.03 0.11
date line information is removed from the text. The target
0.18 classes we wish to identify include the following fifteen
author 0.03 booktitle
0.66 categories: title, author, affiliation, address, note, email,
date, abstract, introduction, phone, keywords, web, de-
journal 1.0 gree, publication number, and page.
0.33 volume Manually tagged headers are split into a 500-header,
23,557 word token labeled training set and a 435-header,
20,308 word token test set. Unigram language models
Figure 3: Illustrative example of an HMM for reference are built for each class and smoothed using a modified
extraction. form of absolute discounting. Each state uses its class
unigram distribution as its emission distribution.
We compare the performance of a model with one state
per class (Baseline) to that of models with multiple states
Y
l
per class (M-merged, V-merged). The multi-state models
V (x|M ) = arg max P(qk−1 → qk )P(qk ↑ xk ), (3) are derived from training data in the following way: a
q1...ql ∈Ql k=1
maximally-specific HMM is built where each word token
where M is the model, P(qk−1 → qk ) is the probability of in the training set is assigned a single state that only
transitioning between states qk−1 and qk , and P(qk ↑ xk ) transitions to the state that follows it. Each state is as-
is the probability of state qk emitting output symbol sociated with the class label of its word token. Then,
xk . The Viterbi algorithm [Viterbi, 1967] can be used to the HMM is put through a series of state merges in or-
efficiently recover this state sequence. der to generalize the model. First, “neighbor merging”
HMMs may be used for extracting information from combines all states that share a unique transition and
research papers by formulating a model in the following have the same class label. For example, all adjacent title
way: each state is associated with a class that we want states are merged into one title state. As two states are
to extract, such as title, author, or affiliation. Each state merged, transition counts are preserved, introducing a
emits words from a class-specific unigram distribution. self-loop on the new merged state. The neighbor-merged
In order to label new text with classes, we treat the words model is used as the starting point for the two multi-state
from the new text as observations and recover the most- models. Manual merge decisions are made in an itera-
likely state sequence. The state that produces each word tive manner to produce the M-merged model, and an au-
is the class tag for that word. An illustrative example of tomatic forward and backward V-merging procedure is
an HMM for reference extraction is shown in Figure 3. used to produce the V-merged model. V-merging consists
Our work with HMMs for information extraction fo- of merging any two states that share transitions from or
cuses on learning the appropriate model structure (the to a common state and have the same label. Transition
number of states and transitions) automatically from probabilities for the three models are set to their max-
data. Other systems using HMMs for information ex- imum likelihood estimates; the baseline model takes its
traction include that by Leek [1997], which extracts in- transition counts directly from the labeled training data,
formation about gene names and locations from scientific whereas the multi-state models use the counts that have
abstracts, and the Nymble system [Bikel et al., 1997] for been preserved during the state merging process.
named-entity extraction. These systems do not consider Model performance is measured by word classification
automatically determining model structure from data; accuracy, which is the percentage of header words that
they either use one state per class, or use hand-built are emitted by a state with the same label as the words’
models assembled by inspecting training examples. true label. Extraction results are presented in Table 1.
Hidden Markov models do well at extracting header in-
4.1 Experiments formation; the best performance of 91.1% is obtained
The goal of our information extraction experiments is with the M-merged model. Both of the multi-state mod-
to investigate whether a model with multiple states per els outperform the Baseline model, indicating that the
class, either manually or automatically derived, outper- richer representation available through models derived
forms a model with only one state per class for header from data is beneficial. However, the automatically-
extraction. We define the header of a research paper derived V-merged model does not perform as well as the
to be all of the words from the beginning of the paper manually-derived M-merged model. The V-merged model
Number of Number of [Bollacker et al., 1998] K. Bollacker, S. Lawrence, and C. L.
Model states transitions Accuracy Giles. CiteSeer: An autonomous web agent for automatic
retrieval and identification of interesting publications. In
Baseline 17 149 89.8 Agents ’98, 1998.
M-merged 36 164 91.1
[Boyan et al., 1996] Justin Boyan, Dayne Freitag, and
V-merged 155 402 90.2
Thorsten Joachims. A machine learning architecture for
optimizing web search engines. In AAAI workshop on
Table 1: Extraction accuracy (%) for the Baseline, M- Internet-Based Information Systems, 1996.
merged and V-merged models. [Cohen, 1998] W. Cohen. A web-based information system
that reasons with structured collections of text. In Agents
’98, 1998.
is limited in the state merges it can perform, whereas the
M-merged model is unrestricted. We expect that more [Craven et al., 1998] M. Craven, D. DiPasquo, D. Freitag,
sophisticated state merging techniques, as discussed in A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.
[Seymore et al., 1999], will result in superior-performing Learning to extract symbolic knowledge from the World
Wide Web. In AAAI-98, 1998.
models for information extraction.
[Dempster et al., 1977] A. P. Dempster, N. M. Laird, and
5 Related Work D. B. Rubin. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical
Several related research projects are investigating the au- Society, Series B, 39(1):1–38, 1977.
tomatic construction of special-purpose web sites. The [Joachims et al., 1997] T. Joachims,
New Zealand Digital Library project [Witten et al., D. Freitag, and T. Mitchell. Webwatcher: A tour guide
1998] has created publicly-available search engines for for the World Wide Web. In IJCAI-97, 1997.
domains from computer science technical reports to song [Kaelbling et al., 1996] L. Kaelbling, M. Littman, and
melodies using manually identified web sources. The A. Moore. Reinforcement learning: A survey. Journal
CiteSeer project [Bollacker et al., 1998] has also devel- of Artificial Intelligence Research, 4:237–285, 1996.
oped a search engine for computer science research pa-
[Leek, 1997] T. Leek. Information extraction using hidden
pers that provides similar functionality for matching ref-
Markov models. Master’s thesis, UCSD, 1997.
erences and searching. The WebKB project [Craven et
al., 1998] uses machine learning techniques to extract [McCallum et al., 1998] A. McCallum, R. Rosenfeld,
domain-specific information available on the Web into a T. Mitchell, and A. Ng. Improving text clasification by
shrinkage in a hierarchy of classes. In ICML-98, 1998.
knowledge base. The WHIRL project [Cohen, 1998] is an
effort to integrate a variety of topic-specific sources into a [McCallum et al., 1999] A. McCallum, K. Nigam, J. Rennie,
single domain-specific search engine using HTML-based and K. Seymore. Building domain-specific search engines
extraction patterns and fuzzy matching for information with machine learning techniques. In AAAI Spring Sym-
posium on Intelligent Agents in Cyberspace, 1999.
retrieval searching.
[Menczer, 1997] F. Menczer. ARACHNID: Adaptive re-
6 Conclusions trieval agents choosing heuristic neighborhoods for infor-
mation discovery. In ICML-97, 1997.
The amount of information available on the Internet con-
[Nigam et al., 1999] K. Nigam, A. McCallum, S. Thrun, and
tinues to grow exponentially. As this trend continues, T. Mitchell. Text classification from labeled and unlabeled
we argue that, not only will the public need powerful documents using EM. Machine Learning, 1999. To appear.
tools to help them sort though this information, but the
[Rabiner, 1989] L. R. Rabiner. A tutorial on hidden Markov
creators of these tools will need intelligent techniques
models and selected applications in speech recognition.
to help them build and maintain these tools. This pa- Proceedings of the IEEE, 77(2):257–286, 1989.
per has shown that machine learning techniques can sig-
nificantly aid the creation and maintenance of domain- [Rennie and McCallum, 1999] Jason Rennie and Andrew
McCallum. Using reinforcement learning to spider the Web
specific search engines. We have presented new research
efficiently. In ICML-99, 1999.
in reinforcement learning, text classification and infor-
mation extraction towards this end. In future work, we [Seymore et al., 1999] K. Seymore, A. McCallum, and
will apply machine learning to automate more aspects of R. Rosenfeld. Learning hidden Markov model structure
domain-specific search engines, such as creating a topic for information extraction. In AAAI Workshop on Ma-
chine Learning for Information Extraction, 1999.
hierarchy with clustering and automatically identifying
seminal papers with citation graph analysis. We will [Viterbi, 1967] A. J. Viterbi. Error bounds for convolutional
also verify that the techniques in this paper generalize codes and an asymtotically optimum decoding algorithm.
IEEE Transactions on Information Theory, IT-13:260–
by applying them to a new domain.
269, 1967.
References [Witten et al., 1998] I. Witten, C. Nevill-Manning, R. Mc-
Nab, and S. J. Cunnningham. A public digital library
[Bikel et al., 1997] D. Bikel, S. Miller, R. Schwartz, and based on full-text retrieval: Collections and experience.
R. Weischedel. Nymble: a high-performance learning Communications of the ACM, 41(4):71–75, 1998.
name-finder. In ANLP-97, 1997.