Machine Learning in Automated Text Categorization

FABRIZIO SEBASTIANI
Consiglio Nazionale delle Ricerche, Italy
The automated categorization (or classification) of texts into predefined categories has
witnessed a booming interest in the last 10 years, due to the increased availability of
documents in digital form and the ensuing need to organize them. In the research
community the dominant approach to this problem is based on machine learning
techniques: a general inductive process automatically builds a classifier by learning,
from a set of preclassified documents, the characteristics of the categories. The
advantages of this approach over the knowledge engineering approach (consisting in
the manual definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert labor power, and straightforward portability to
different domains. This survey discusses the main approaches to text categorization
that fall within the machine learning paradigm. We will discuss in detail issues
pertaining to three different problems, namely, document representation, classifier
construction, and classifier evaluation.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]:
Content Analysis and Indexing—Indexing methods; H.3.3 [Information Storage and
Retrieval]: Information Search and Retrieval—Information filtering; H.3.4
[Information Storage and Retrieval]: Systems and Software—Performance
evaluation (efficiency and effectiveness); I.2.6 [Artificial Intelligence]: Learning—
Induction
General Terms: Algorithms, Experimentation, Theory
Additional Key Words and Phrases: Machine learning, text categorization, text
classification
1. INTRODUCTION
In the last 10 years content-based doc-
ument management tasks (collectively
known as information retrieval—IR) have
gained a prominent status in the informa-
tion systems field, due to the increased
availability of documents in digital form
and the ensuing need to access them in
flexible ways. Text categorization (TC—
a.k.a. text classification, or topic spotting),
the activity of labeling natural language
Author’s address: Istituto di Elaborazione dell’Informazione, Consiglio Nazionale delle Ricerche, Via G.
Moruzzi 1, 56124 Pisa, Italy; e-mail: fabrizio@iei.pi.cnr.it.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted
without fee provided that the copies are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication, and its date appear, and notice is given that copying is by
permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,
requires prior specific permission and/or a fee.
c ´2002 ACM 0360-0300/02/0300-0001 $5.00
texts with thematic categories from a pre-
defined set, is one such task. TC dates
back to the early ’60s, but only in the early
’90s did it become a major subfield of the
information systems discipline, thanks to
increased applicative interest and to the
availability of more powerful hardware.
TC is now being applied in many contexts,
ranging from document indexing based
on a controlled vocabulary, to document
filtering, automated metadata generation,
word sense disambiguation, population of
ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1–47.
2 Sebastiani
hierarchical catalogues of Web resources,
and in general any application requiring
document organization or selective and
adaptive document dispatching.
Until the late ’80s the most popular ap-
proach to TC, at least in the “operational”
(i.e., real-world applications) community,
was a knowledge engineering (KE) one,
consisting in manually defining a set of
rules encoding expert knowledge on how
to classify documents under the given cat-
egories. In the ’90s this approach has in-
creasingly lost popularity (especially in
the research community) in favor of the
machine learning (ML) paradigm, accord-
ing to which a general inductive process
automatically builds an automatic text
classifier by learning, froma set of preclas-
sified documents, the characteristics of the
categories of interest. The advantages of
this approach are an accuracy comparable
to that achieved by human experts, and
a considerable savings in terms of expert
labor power, since no intervention from ei-
ther knowledge engineers or domain ex-
perts is needed for the construction of the
classifier or for its porting to a different set
of categories. It is the ML approach to TC
that this paper concentrates on.
Current-day TC is thus a discipline at
the crossroads of ML and IR, and as
such it shares a number of characteris-
tics with other tasks such as information/
knowledge extraction from texts and text
mining [Knight 1999; Pazienza 1997].
There is still considerable debate on where
the exact border between these disciplines
lies, and the terminology is still evolving.
“Text mining” is increasingly being used
to denote all the tasks that, by analyz-
ing large quantities of text and detect-
ing usage patterns, try to extract probably
useful (although only probably correct)
information. According to this view, TC is
an instance of text mining. TCenjoys quite
a rich literature now, but this is still fairly
scattered.
1
Although two international
journals have devoted special issues to
1
A fully searchable bibliography on TC created and
maintained by this author is available at http://
liinwww.ira.uka.de/bibliography/Ai/automated.text.
categorization.html.
this topic [Joachims and Sebastiani 2002;
Lewis and Hayes 1994], there are no sys-
tematic treatments of the subject: there
are neither textbooks nor journals en-
tirely devoted to TC yet, and Manning
and Sch¨ utze [1999, Chapter 16] is the only
chapter-length treatment of the subject.
As a note, we should warn the reader
that the term “automatic text classifica-
tion” has sometimes been used in the liter-
ature to mean things quite different from
the ones discussed here. Aside from (i) the
automatic assignment of documents to a
predefined set of categories, which is the
main topic of this paper, the term has also
been used to mean (ii) the automatic iden-
tification of such a set of categories (e.g.,
Borko and Bernick [1963]), or (iii) the au-
tomatic identification of such a set of cat-
egories and the grouping of documents
under them (e.g., Merkl [1998]), a task
usually called text clustering, or (iv) any
activity of placing text items into groups,
a task that has thus both TCand text clus-
tering as particular instances [Manning
and Sch¨ utze 1999].
This paper is organized as follows. In
Section 2 we formally define TC and its
various subcases, and in Section 3 we
review its most important applications.
Section 4 describes the main ideas under-
lying the ML approach to classification.
Our discussion of text classification starts
in Section 5 by introducing text index-
ing, that is, the transformation of textual
documents into a form that can be inter-
preted by a classifier-building algorithm
and by the classifier eventually built by it.
Section 6 tackles the inductive construc-
tion of a text classifier from a “training”
set of preclassified documents. Section 7
discusses the evaluation of text classi-
fiers. Section 8 concludes, discussing open
issues and possible avenues of further
research for TC.
2. TEXT CATEGORIZATION
2.1. A Definition of Text Categorization
Text categorization is the task of assigning
a Boolean value to each pair !d
j
, c
i
) ∈D
C, where D is a domain of documents and
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 3
C = {c
1
, . . . , c
[C[
} is a set of predefined cat-
egories. A value of T assigned to !d
j
, c
i
)
indicates a decision to file d
j
under c
i
,
while a value of F indicates a decision
not to file d
j
under c
i
. More formally, the
task is to approximate the unknown tar-
get function
˘
: DC → {T, F} (that de-
scribes how documents ought to be classi-
fied) by means of a function : D C →
{T, F} called the classifier (aka rule, or
hypothesis, or model) such that
˘
and
“coincide as much as possible.” Howto pre-
cisely define and measure this coincidence
(called effectiveness) will be discussed in
Section 7.1. From now on we will assume
that:
—The categories are just symbolic la-
bels, and no additional knowledge (of
a procedural or declarative nature) of
their meaning is available.
—No exogenous knowledge (i.e., data pro-
vided for classification purposes by an
external source) is available; therefore,
classification must be accomplished on
the basis of endogenous knowledge only
(i.e., knowledge extracted from the doc-
uments). In particular, this means that
metadata such as, for example, pub-
lication date, document type, publica-
tion source, etc., is not assumed to be
available.
The TC methods we will discuss are
thus completely general, and do not de-
pend on the availability of special-purpose
resources that might be unavailable or
costly to develop. Of course, these as-
sumptions need not be verified in opera-
tional settings, where it is legitimate to
use any source of information that might
be available or deemed worth developing
[D´ıaz Esteban et al. 1998; Junker and
Abecker 1997]. Relying only on endoge-
nous knowledge means classifying a docu-
ment based solely on its semantics, and
given that the semantics of a document
is a subjective notion, it follows that the
membership of a document in a cate-
gory (pretty much as the relevance of a
document to an information need in IR
[Saracevic 1975]) cannot be decided de-
terministically. This is exemplified by the
phenomenon of inter-indexer inconsistency
[Cleverdon 1984]: when two human ex-
perts decide whether to classify document
d
j
under category c
i
, they may disagree,
and this in fact happens with relatively
high frequency. A news article on Clinton
attending Dizzy Gillespie’s funeral could
be filed under Politics, or under Jazz, or un-
der both, or even under neither, depending
on the subjective judgment of the expert.
2.2. Single-Label Versus Multilabel
Text Categorization
Different constraints may be enforced on
the TC task, depending on the applica-
tion. For instance we might need that, for
a given integer k, exactly k (or ≤k, or ≥k)
elements of C be assigned to each d
j
∈D.
The case in which exactly one category
must be assigned to each d
j
∈D is often
called the single-label (a.k.a. nonoverlap-
ping categories) case, while the case in
which any number of categories from 0
to [C[ may be assigned to the same d
j
∈D
is dubbed the multilabel (aka overlapping
categories) case. A special case of single-
label TCis binary TC, inwhicheachd
j
∈D
must be assigned either to category c
i
or
to its complement ¯ c
i
.
From a theoretical point of view, the
binary case (hence, the single-label case,
too) is more general than the multilabel,
since an algorithm for binary classifica-
tion can also be used for multilabel clas-
sification: one needs only transform the
problem of multilabel classification under
{c
1
, . . . , c
[C[
} into [C[ independent problems
of binary classification under {c
i
, ¯ c
i
}, for
i =1, . . . , [C[. However, this requires that
categories be stochastically independent
of each other, that is, for any c
/
, c
//
, the
value of
˘
(d
j
, c
/
) does not depend on
the value of
˘
(d
j
, c
//
) and vice versa;
this is usually assumed to be the case
(applications in which this is not the case
are discussed in Section 3.5). The converse
is not true: an algorithm for multilabel
classification cannot be used for either bi-
nary or single-label classification. In fact,
givena document d
j
to classify, (i) the clas-
sifier might attribute k >1 categories to
d
j
, and it might not be obvious how to
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
4 Sebastiani
choose a “most appropriate” category from
them; or (ii) the classifier might attribute
to d
j
no category at all, and it might not
be obvious how to choose a “least inappro-
priate” category from C.
In the rest of the paper, unless explicitly
mentioned, we will deal with the binary
case. There are various reasons for this:
—The binary case is important in itself
because important TC applications, in-
cluding filtering (see Section 3.3), con-
sist of binary classification problems
(e.g., deciding whether d
j
is about Jazz
or not). In TC, most binary classification
problems feature unevenly populated
categories (e.g., much fewer documents
are about Jazz than are not) and un-
evenly characterized categories (e.g.,
what is about Jazz can be characterized
much better than what is not).
—Solving the binary case also means solv-
ing the multilabel case, which is also
representative of important TC applica-
tions, including automated indexing for
Boolean systems (see Section 3.1).
—Most of the TC literature is couched in
terms of the binary case.
—Most techniques for binary classifica-
tion are just special cases of existing
techniques for the single-label case, and
are simpler to illustrate than these
latter.
This ultimately means that we will view
classification under C ={c
1
, . . . , c
[C[
} as
consisting of [C[ independent problems of
classifying the documents in D under a
given category c
i
, for i = 1, . . . , [C[. A clas-
sifier for c
i
is then a function
i
: D →
{T, F} that approximates anunknowntar-
get function
˘

i
: D →{T, F}.
2.3. Category-Pivoted Versus
Document-Pivoted Text Categorization
There are two different ways of using
a text classifier. Given d
j
∈D, we might
want to find all the c
i
∈C under which it
should be filed (document-pivoted catego-
rization—DPC); alternatively, given c
i
∈C,
we might want to find all the d
j
∈ D that
should be filed under it (category-pivoted
categorization—CPC). This distinction is
more pragmatic than conceptual, but is
important since the sets C and Dmight not
be available in their entirety right from
the start. It is also relevant to the choice
of the classifier-building method, as some
of these methods (see Section 6.9) allow
the construction of classifiers with a defi-
nite slant toward one or the other style.
DPC is thus suitable when documents
become available at different moments in
time, e.g., in filtering e-mail. CPC is in-
stead suitable when (i) a new category
c
[C[÷1
may be added to an existing set
C ={c
1
, . . . , c
[C[
} after a number of docu-
ments have already been classified under
C, and (ii) these documents need to be re-
considered for classification under c
[C[÷1
(e.g., Larkey [1999]). DPC is used more of-
ten than CPC, as the former situation is
more common than the latter.
Although some specific techniques ap-
ply to one style and not to the other (e.g.,
the proportional thresholding method dis-
cussed in Section 6.1 applies only to CPC),
this is more the exception than the rule:
most of the techniques we will discuss al-
low the construction of classifiers capable
of working in either mode.
2.4. “Hard” Categorization Versus
Ranking Categorization
While a complete automation of the
TC task requires a T or F decision
for each pair !d
j
, c
i
), a partial automa-
tion of this process might have different
requirements.
For instance, given d
j
∈D a system
might simply rank the categories in
C ={c
1
, . . . , c
[C[
} according to their esti-
mated appropriateness to d
j
, without tak-
ing any “hard” decision on any of them.
Such a ranked list would be of great
help to a human expert in charge of
taking the final categorization decision,
since she could thus restrict the choice
to the category (or categories) at the top
of the list, rather than having to examine
the entire set. Alternatively, given c
i
∈C
a system might simply rank the docu-
ments in D according to their estimated
appropriateness to c
i
; symmetrically, for
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 5
classification under c
i
a human expert
would just examine the top-ranked doc-
uments instead of the entire document
set. These two modalities are sometimes
called category-ranking TCand document-
ranking TC [Yang 1999], respectively,
and are the obvious counterparts of DPC
and CPC.
Semiautomated, “interactive” classifica-
tion systems [Larkey and Croft 1996] are
useful especially in critical applications
in which the effectiveness of a fully au-
tomated system may be expected to be
significantly lower than that of a human
expert. This may be the case when the
quality of the training data (see Section 4)
is low, or when the training documents
cannot be trusted to be a representative
sample of the unseen documents that are
to come, so that the results of a completely
automatic classifier could not be trusted
completely.
In the rest of the paper, unless explicitly
mentioned, we will deal with “hard” classi-
fication; however, many of the algorithms
we will discuss naturally lend themselves
to ranking TC too (more details on this in
Section 6.1).
3. APPLICATIONS OF TEXT
CATEGORIZATION
TC goes back to Maron’s [1961] semi-
nal work on probabilistic text classifica-
tion. Since then, it has been used for a
number of different applications, of which
we here briefly review the most impor-
tant ones. Note that the borders between
the different classes of applications listed
here are fuzzy and somehowartificial, and
some of these may be considered special
cases of others. Other applications we do
not explicitly discuss are speech catego-
rization by means of a combination of
speech recognition and TC [Myers et al.
2000; Schapire and Singer 2000], multi-
media document categorization through
the analysis of textual captions [Sable
and Hatzivassiloglou 2000], author iden-
tification for literary texts of unknown or
disputed authorship [Forsyth 1999], lan-
guage identification for texts of unknown
language [Cavnar and Trenkle 1994],
automated identification of text genre
[Kessler et al. 1997], and automated essay
grading [Larkey 1998].
3.1. Automatic Indexing for Boolean
Information Retrieval Systems
The application that has spawned most
of the early research in the field [Borko
and Bernick 1963; Field 1975; Gray and
Harley 1971; Heaps 1973; Maron 1961]
is that of automatic document indexing
for IR systems relying on a controlled
dictionary, the most prominent example
of which is Boolean systems. In these
latter each document is assigned one or
more key words or key phrases describ-
ing its content, where these key words and
key phrases belong to a finite set called
controlled dictionary, often consisting of
a thematic hierarchical thesaurus (e.g.,
the NASA thesaurus for the aerospace
discipline, or the MESH thesaurus for
medicine). Usually, this assignment is
done by trained human indexers, and is
thus a costly activity.
If the entries in the controlled vocab-
ulary are viewed as categories, text in-
dexing is an instance of TC, and may
thus be addressed by the automatic tech-
niques described in this paper. Recall-
ing Section 2.2, note that this applica-
tion may typically require that k
1
≤x ≤k
2
key words are assigned to each docu-
ment, for given k
1
, k
2
. Document-pivoted
TC is probably the best option, so that
new documents may be classified as they
become available. Various text classifiers
explicitly conceived for document index-
ing have been described in the literature;
see, for example, Fuhr and Knorz [1984],
Robertsonand Harding [1984], and Tzeras
and Hartmann [1993].
Automatic indexing with controlled dic-
tionaries is closely related to automated
metadata generation. In digital libraries,
one is usually interested in tagging doc-
uments by metadata that describes them
under a variety of aspects (e.g., creation
date, document type or format, availabil-
ity, etc.). Some of this metadata is the-
matic, that is, its role is to describe the
semantics of the document by means of
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
6 Sebastiani
bibliographic codes, key words or key
phrases. The generation of this metadata
may thus be viewed as a problem of doc-
ument indexing with controlled dictio-
nary, and thus tackled by means of TC
techniques.
3.2. Document Organization
Indexing with a controlled vocabulary is
aninstance of the general problemof docu-
ment base organization. In general, many
other issues pertaining to document or-
ganization and filing, be it for purposes
of personal organization or structuring of
a corporate document base, may be ad-
dressed by TC techniques. For instance,
at the offices of a newspaper incoming
“classified” ads must be, prior to publi-
cation, categorized under categories such
as Personals, Cars for Sale, Real Estate,
etc. Newspapers dealing with a high vol-
ume of classified ads would benefit froman
automatic system that chooses the most
suitable category for a given ad. Other
possible applications are the organiza-
tion of patents into categories for mak-
ing their search easier [Larkey 1999], the
automatic filing of newspaper articles un-
der the appropriate sections (e.g., Politics,
Home News, Lifestyles, etc.), or the auto-
matic grouping of conference papers into
sessions.
3.3. Text Filtering
Text filtering is the activity of classify-
ing a stream of incoming documents dis-
patched in an asynchronous way by an
information producer to an information
consumer [Belkin and Croft 1992]. A typ-
ical case is a newsfeed, where the pro-
ducer is a news agency and the consumer
is a newspaper [Hayes et al. 1990]. In
this case, the filtering systemshould block
the delivery of the documents the con-
sumer is likely not interested in (e.g., all
news not concerning sports, in the case
of a sports newspaper). Filtering can be
seen as a case of single-label TC, that
is, the classification of incoming docu-
ments into two disjoint categories, the
relevant and the irrelevant. Additionally,
a filtering system may also further clas-
sify the documents deemed relevant to
the consumer into thematic categories;
in the example above, all articles about
sports should be further classified accord-
ing to which sport they deal with, so as
to allow journalists specialized in indi-
vidual sports to access only documents of
prospective interest for them. Similarly,
an e-mail filter might be trained to discard
“junk” mail [Androutsopoulos et al. 2000;
Drucker et al. 1999] and further classify
nonjunk mail into topical categories of in-
terest to the user.
A filtering system may be installed at
the producer end, in which case it must
route the documents to the interested con-
sumers only, or at the consumer end, in
which case it must block the delivery of
documents deemed uninteresting to the
consumer. In the former case, the system
builds and updates a “profile” for each con-
sumer [Liddy et al. 1994], while in the lat-
ter case (which is the more common, and
to which we will refer in the rest of this
section) a single profile is needed.
A profile may be initially specified by
the user, thereby resembling a standing
IR query, and is updated by the system
by using feedback information provided
(either implicitly or explicitly) by the user
on the relevance or nonrelevance of the de-
liveredmessages. Inthe TRECcommunity
[Lewis 1995c], this is called adaptive fil-
tering, while the case in which no user-
specified profile is available is called ei-
ther routing or batch filtering, depending
on whether documents have to be ranked
in decreasing order of estimated relevance
or just accepted/rejected. Batch filtering
thus coincides with single-label TC un-
der [C[ =2 categories; since this latter is
a completely general TC task, some au-
thors [Hull 1994; Hull et al. 1996; Schapire
et al. 1998; Sch¨ utze et al. 1995], some-
what confusingly, use the term “filtering”
in place of the more appropriate term
“categorization.”
In information science, document filter-
ing has a tradition dating back to the
’60s, when, addressed by systems of var-
ious degrees of automation and dealing
with the multiconsumer case discussed
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 7
above, it was called selective dissemina-
tion of information or current awareness
(see Korfhage [1997, Chapter 6]). The ex-
plosion in the availability of digital infor-
mation has boosted the importance of such
systems, which are nowadays being used
in contexts such as the creation of person-
alized Web newspapers, junk e-mail block-
ing, and Usenet news selection.
Information filtering by ML techniques
is widely discussed in the literature: see
Amati and Crestani [1999], Iyer et al.
[2000], Kim et al. [2000], Tauritz et al.
[2000], and Yu and Lam [1998].
3.4. Word Sense Disambiguation
Word sense disambiguation (WSD) is the
activity of finding, given the occurrence in
a text of an ambiguous (i.e., polysemous
or homonymous) word, the sense of this
particular word occurrence. For instance,
bank may have (at least) two different
senses in English, as in the Bank of
England (a financial institution) or the
bank of river Thames (a hydraulic engi-
neering artifact). It is thus a WSD task
to decide which of the above senses the oc-
currence of bank in Last week I borrowed
some money from the bank has. WSD is
very important for many applications, in-
cluding natural language processing, and
indexing documents by word senses rather
than by words for IR purposes. WSD may
be seen as a TC task (see Gale et al.
[1993]; Escudero et al. [2000]) once we
view word occurrence contexts as doc-
uments and word senses as categories.
Quite obviously, this is a single-label TC
case, and one in which document-pivoted
TC is usually the right choice.
WSDis just an example of the more gen-
eral issue of resolving natural language
ambiguities, one of the most important
problems in computational linguistics.
Other examples, which may all be tackled
by means of TC techniques along the lines
discussed for WSD, are context-sensitive
spelling correction, prepositional phrase
attachment, part of speech tagging, and
word choice selection in machine transla-
tion; see Roth [1998] for an introduction.
3.5. Hierarchical Categorization
of Web Pages
TC has recently aroused a lot of interest
also for its possible application to auto-
matically classifying Web pages, or sites,
under the hierarchical catalogues hosted
by popular Internet portals. When Web
documents are catalogued in this way,
rather than issuing a query to a general-
purpose Web search engine a searcher
may find it easier to first navigate in
the hierarchy of categories and then re-
strict her search to a particular category
of interest.
Classifying Web pages automatically
has obvious advantages, since the man-
ual categorization of a large enough sub-
set of the Web is infeasible. Unlike in the
previous applications, it is typically the
case that each category must be populated
by a set of k
1
≤x ≤k
2
documents. CPC
should be chosen so as to allow new cate-
gories to be added and obsolete ones to be
deleted.
With respect to previously discussed TC
applications, automatic Web page catego-
rization has two essential peculiarities:
(1) The hypertextual nature of the doc-
uments: Links are a rich source of
information, as they may be under-
stood as stating the relevance of the
linked page to the linking page. Tech-
niques exploiting this intuition in a
TC context have been presented by
Attardi et al. [1998], Chakrabarti et al.
[1998b], F¨ urnkranz [1999], G¨ overt
et al. [1999], and Oh et al. [2000]
and experimentally compared by Yang
et al. [2002].
(2) The hierarchical structure of the cate-
gory set: This may be used, for example,
by decomposing the classificationprob-
leminto a number of smaller classifica-
tion problems, each corresponding to a
branching decisionat aninternal node.
Techniques exploiting this intuition in
a TC context have been presented by
Dumais and Chen [2000], Chakrabarti
et al. [1998a], Koller and Sahami
[1997], McCallum et al. [1998], Ruiz
and Srinivasan [1999], and Weigend
et al. [1999].
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
8 Sebastiani
if ((wheat & farm) or
(wheat & commodity) or
(bushels & export) or
(wheat & tonnes) or
(wheat & winter & ¬ soft)) then WHEAT else ¬ WHEAT
Fig. 1. Rule-based classifier for the WHEAT category; key words
are indicated in italic, categories are indicated in SMALL CAPS (from
Apt´ e et al. [1994]).
4. THE MACHINE LEARNING APPROACH
TO TEXT CATEGORIZATION
In the ’80s, the most popular approach
(at least in operational settings) for the
creation of automatic document classifiers
consisted in manually building, by means
of knowledge engineering (KE) techniques,
an expert system capable of taking TC de-
cisions. Such an expert system would typ-
ically consist of a set of manually defined
logical rules, one per category, of type
if !DNF formula) then !category).
A DNF (“disjunctive normal form”) for-
mula is a disjunction of conjunctive
clauses; the document is classified under
!category) iff it satisfies the formula, that
is, iff it satisfies at least one of the clauses.
The most famous example of this approach
is the CONSTRUE system[Hayes et al. 1990],
built by Carnegie Group for the Reuters
news agency. A sample rule of the type
used in CONSTRUE is illustrated in Figure 1.
The drawback of this approach is
the knowledge acquisition bottleneck well
known fromthe expert systems literature.
That is, the rules must be manually de-
fined by a knowledge engineer with the
aid of a domain expert (in this case, an
expert in the membership of documents in
the chosen set of categories): if the set of
categories is updated, then these two pro-
fessionals must intervene again, and if the
classifier is ported to a completely differ-
ent domain (i.e., set of categories), a differ-
ent domain expert needs to intervene and
the work has to be repeated from scratch.
On the other hand, it was originally
suggested that this approach can give very
good effectiveness results: Hayes et al.
[1990] reported a .90 “breakeven” result
(see Section 7) on a subset of the Reuters
test collection, a figure that outperforms
even the best classifiers built in the late
’90s by state-of-the-art ML techniques.
However, no other classifier has been
tested on the same dataset as CONSTRUE,
and it is not clear whether this was a
randomly chosen or a favorable subset of
the entire Reuters collection. As argued
by Yang [1999], the results above do not
allow us to state that these effectiveness
results may be obtained in general.
Since the early ’90s, the ML approach
to TC has gained popularity and has
eventually become the dominant one, at
least in the research community (see
Mitchell [1996] for a comprehensive intro-
duction to ML). In this approach, a general
inductive process (also called the learner)
automatically builds a classifier for a cat-
egory c
i
by observing the characteristics
of a set of documents manually classified
under c
i
or ¯ c
i
by a domain expert; from
these characteristics, the inductive pro-
cess gleans the characteristics that a new
unseen document should have in order to
be classified under c
i
. In ML terminology,
the classification problem is an activity
of supervised learning, since the learning
process is “supervised” by the knowledge
of the categories and of the training in-
stances that belong to them.
2
The advantages of the ML approach
over the KE approach are evident. The en-
gineering effort goes toward the construc-
tion not of a classifier, but of an automatic
builder of classifiers (the learner). This
means that if a learner is (as it often is)
available off-the-shelf, all that is needed
is the inductive, automatic construction of
a classifier from a set of manually clas-
sified documents. The same happens if a
2
Within the area of content-based document man-
agement tasks, an example of an unsupervised learn-
ing activity is document clustering (see Section 1).
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 9
classifier already exists and the original
set of categories is updated, or if the clas-
sifier is ported to a completely different
domain.
In the ML approach, the preclassified
documents are then the key resource.
In the most favorable case, they are al-
ready available; this typically happens for
organizations which have previously car-
ried out the same categorization activity
manually and decide to automate the pro-
cess. The less favorable case is when no
manually classified documents are avail-
able; this typically happens for organi-
zations that start a categorization activ-
ity and opt for an automated modality
straightaway. The ML approach is more
convenient than the KE approach also in
this latter case. In fact, it is easier to man-
ually classify a set of documents than to
build and tune a set of rules, since it is
easier to characterize a concept extension-
ally (i.e., to select instances of it) than in-
tensionally (i.e., to describe the concept in
words, or to describe a procedure for rec-
ognizing its instances).
Classifiers built by means of ML tech-
niques nowadays achieve impressive lev-
els of effectiveness (see Section 7), making
automatic classification a qualitatively
(and not only economically) viable alter-
native to manual classification.
4.1. Training Set, Test Set, and
Validation Set
The ML approach relies on the availabil-
ity of an initial corpus ={d
1
, . . . , d
[[
} ⊂
D of documents preclassified under C =
{c
1
, . . . , c
[C[
}. That is, the values of the total
function
˘
: DC →{T, F} are known for
every pair !d
j
, c
i
) ∈ C. A document d
j
is a positive example of c
i
if
˘
(d
j
, c
i
) = T,
a negative example of c
i
if
˘
(d
j
, c
i
) = F.
In research settings (and in most opera-
tional settings too), once a classifier has
been built it is desirable to evaluate its ef-
fectiveness. In this case, prior to classifier
construction the initial corpus is split in
two sets, not necessarily of equal size:
—a training(-and-validation) set TV =
{d
1
, . . . , d
[TV[
}. The classifier for cat-
egories C = {c
1
, . . . , c
[C[
} is inductively
built by observing the characteristics of
these documents;
—a test set Te = {d
[TV[÷1
, . . . , d
[[
}, used
for testing the effectiveness of the clas-
sifiers. Each d
j
∈Te is fed to the classi-
fier, and the classifier decisions (d
j
, c
i
)
are compared with the expert decisions
˘
(d
j
, c
i
). A measure of classification
effectiveness is based on how often
the (d
j
, c
i
) values match the
˘
(d
j
, c
i
)
values.
The documents in Te cannot participate
in any way in the inductive construc-
tion of the classifiers; if this condition
were not satisfied, the experimental re-
sults obtained would likely be unrealis-
tically good, and the evaluation would
thus have no scientific character [Mitchell
1996, page 129]. In an operational setting,
after evaluation has been performed one
would typically retrain the classifier on
the entire initial corpus, in order to boost
effectiveness. In this case, the results of
the previous evaluation would be a pes-
simistic estimate of the real performance,
since the final classifier has been trained
onmore data thanthe classifier evaluated.
This is called the train-and-test ap-
proach. An alternative is the k-fold cross-
validation approach (see Mitchell [1996],
page 146), in which k different classi-
fiers
1
, . . . ,
k
are built by partition-
ing the initial corpus into k disjoint sets
Te
1
, . . . , Te
k
and then iteratively apply-
ing the train-and-test approach on pairs
!TV
i
=−Te
i
, Te
i
). The final effectiveness
figure is obtained by individually comput-
ing the effectiveness of
1
, . . . ,
k
, and
then averaging the individual results in
some way.
In both approaches, it is often the case
that the internal parameters of the clas-
sifiers must be tuned by testing which
values of the parameters yield the best
effectiveness. In order to make this op-
timization possible, in the train-and-test
approach the set {d
1
, . . . , d
[TV[
} is further
split into a training set Tr ={d
1
, . . . , d
[Tr[
},
fromwhichthe classifier is built, anda val-
idation set Va={d
[Tr[÷1
, . . . , d
[TV[
} (some-
times called a hold-out set), on which
the repeated tests of the classifier aimed
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
10 Sebastiani
at parameter optimization are performed;
the obvious variant may be used in the
k-fold cross-validation case. Note that, for
the same reason why we do not test a clas-
sifier on the documents it has been trained
on, we do not test it on the documents it
has been optimized on: test set and vali-
dation set must be kept separate.
3
Given a corpus , one may define the
generality g

(c
i
) of a category c
i
as the
percentage of documents that belong to c
i
,
that is:
g

(c
i
) =
[{d
j
∈ [
˘
(d
j
, c
i
) = T}[
[[
.
The training set generality g
Tr
(c
i
), valida-
tion set generality g
Va
(c
i
), and test set gen-
erality g
Te
(c
i
) of c
i
may be defined in the
obvious way.
4.2. Information Retrieval Techniques
and Text Categorization
Text categorization heavily relies on the
basic machinery of IR. The reason is that
TC is a content-based document manage-
ment task, and as such it shares many
characteristics with other IR tasks such
as text search.
IR techniques are used in three phases
of the text classifier life cycle:
(1) IR-style indexing is always performed
on the documents of the initial corpus
and on those to be classified during the
operational phase;
(2) IR-style techniques (such as docu-
ment-request matching, query refor-
mulation, . . .) are often used in the in-
ductive construction of the classifiers;
(3) IR-style evaluation of the effectiveness
of the classifiers is performed.
The various approaches to classification
differ mostly for how they tackle (2),
although in a few cases nonstandard
3
From now on, we will take the freedom to use the
expression “test document” to denote any document
not in the training set and validation set. This in-
cludes thus any document submitted to the classifier
in the operational phase.
approaches to (1) and (3) are also used. In-
dexing, induction, and evaluation are the
themes of Sections 5, 6 and 7, respectively.
5. DOCUMENT INDEXING AND
DIMENSIONALITY REDUCTION
5.1. Document Indexing
Texts cannot be directly interpreted by a
classifier or by a classifier-building algo-
rithm. Because of this, an indexing proce-
dure that maps a text d
j
into a compact
representation of its content needs to be
uniformly applied to training, validation,
and test documents. The choice of a rep-
resentation for text depends on what one
regards as the meaningful units of text
(the problem of lexical semantics) and the
meaningful natural language rules for the
combination of these units (the problem
of compositional semantics). Similarly to
what happens in IR, in TCthis latter prob-
lem is usually disregarded,
4
and a text
d
j
is usually represented as a vector of
term weights
¯
d
j
=!w
1 j
, . . . , w
[T [ j
), where
T is the set of terms (sometimes called
features) that occur at least once in at least
one document of Tr, and 0 ≤ w
kj
≤ 1 rep-
resents, loosely speaking, how much term
t
k
contributes to the semantics of docu-
ment d
j
. Differences among approaches
are accounted for by
(1) different ways to understand what a
term is;
(2) different ways to compute term
weights.
A typical choice for (1) is to identify terms
with words. This is often called either the
set of words or the bag of words approach
to document representation, depending on
whether weights are binary or not.
In a number of experiments [Apt´ e
et al. 1994; Dumais et al. 1998; Lewis
1992a], it has been found that represen-
tations more sophisticated than this do
not yield significantly better effectiveness,
thereby confirming similar results fromIR
4
An exception to this is represented by learning ap-
proaches based on hidden Markov models [Denoyer
et al. 2001; Frasconi et al. 2002].
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 11
[Salton and Buckley 1988]. In particular,
some authors have used phrases, rather
than individual words, as indexing terms
[Fuhr et al. 1991; Sch¨ utze et al. 1995;
Tzeras and Hartmann 1993], but the ex-
perimental results found to date have
not been uniformly encouraging, irrespec-
tively of whether the notion of “phrase” is
motivated
—syntactically, that is, the phrase is such
according to a grammar of the language
(see Lewis [1992a]); or
—statistically, that is, the phrase is
not grammatically such, but is com-
posed of a set/sequence of words whose
patterns of contiguous occurrence in the
collection are statistically significant
(see Caropreso et al. [2001]).
Lewis [1992a] argued that the likely rea-
son for the discouraging results is that,
although indexing languages based on
phrases have superior semantic qualities,
they have inferior statistical qualities
with respect to word-only indexing lan-
guages: a phrase-only indexing language
has “more terms, more synonymous or
nearly synonymous terms, lower consis-
tency of assignment (since synonymous
terms are not assigned to the same docu-
ments), and lower document frequency for
terms” [Lewis 1992a, page 40]. Although
his remarks are about syntactically moti-
vated phrases, they also apply to statisti-
cally motivated ones, although perhaps to
a smaller degree. Acombination of the two
approaches is probably the best way to
go: Tzeras and Hartmann [1993] obtained
significant improvements by using noun
phrases obtained through a combination
of syntactic and statistical criteria, where
a “crude” syntactic method was comple-
mented by a statistical filter (only those
syntactic phrases that occurred at least
three times in the positive examples of a
category c
i
were retained). It is likely that
the final word on the usefulness of phrase
indexing in TC has still to be told, and
investigations in this direction are still
being actively pursued [Caropreso et al.
2001; Mladeni´ c and Grobelnik 1998].
As for issue (2), weights usually
range between 0 and 1 (an exception is
Lewis et al. [1996]), and for ease of expo-
sition we will assume they always do. As a
special case, binary weights may be used
(1 denoting presence and 0 absence of the
term in the document); whether binary or
nonbinary weights are used depends on
the classifier learning algorithm used. In
the case of nonbinary indexing, for deter-
mining the weight w
kj
of term t
k
in docu-
ment d
j
any IR-style indexing technique
that represents a document as a vector of
weighted terms may be used. Most of the
times, the standard tfidf function is used
(see Saltonand Buckley [1988]), defined as
tfidf (t
k
, d
j
) = #(t
k
, d
j
) · log
[Tr[
#
Tr
(t
k
)
, (1)
where #(t
k
, d
j
) denotes the number of
times t
k
occurs in d
j
, and #
Tr
(t
k
) denotes
the document frequency of term t
k
, that
is, the number of documents in Tr in
which t
k
occurs. This function embodies
the intuitions that (i) the more often a
term occurs in a document, the more it
is representative of its content, and (ii)
the more documents a term occurs in,
the less discriminating it is.
5
Note that
this formula (as most other indexing
formulae) weights the importance of a
term to a document in terms of occurrence
considerations only, thereby deeming of
null importance the order in which the
terms occur in the document and the syn-
tactic role they play. In other words, the
semantics of a document is reduced to the
collective lexical semantics of the terms
that occur in it, thereby disregarding the
issue of compositional semantics (an ex-
ception are the representation techniques
used for FOIL [Cohen 1995a] and SLEEPING
EXPERTS [Cohen and Singer 1999]).
In order for the weights to fall in the
[0,1] interval and for the documents to
be represented by vectors of equal length,
the weights resulting from tfidf are often
5
There exist many variants of tfidf, that differ from
each other in terms of logarithms, normalization or
other correction factors. Formula 1 is just one of
the possible instances of this class; see Salton and
Buckley [1988] and Singhal et al. [1996] for varia-
tions on this theme.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
12 Sebastiani
normalized by cosine normalization, given
by
w
kj
=
tfidf (t
k
, d
j
)

¸
[T [
s=1
(tfidf (t
s
, d
j
))
2
. (2)
Although normalized tfidf is the most
popular one, other indexing functions
have also been used, including proba-
bilistic techniques [G¨ overt et al. 1999] or
techniques for indexing structured docu-
ments [Larkey and Croft 1996]. Functions
different from tfidf are especially needed
when Tr is not available in its entirety
from the start and #
Tr
(t
k
) cannot thus be
computed, as in adaptive filtering; in this
case, approximations of tfidf are usually
employed [Dagan et al. 1997, Section 4.3].
Before indexing, the removal of function
words (i.e., topic-neutral words such as ar-
ticles, prepositions, conjunctions, etc.) is
almost always performed (exceptions in-
clude Lewis et al. [1996], Nigam et al.
[2000], and Riloff [1995]).
6
Concerning
stemming (i.e., grouping words that share
the same morphological root), its suitabil-
ity to TC is controversial. Although, simi-
larly to unsupervised term clustering (see
Section 5.5.1) of which it is an instance,
stemming has sometimes been reported
to hurt effectiveness (e.g., Baker and
McCallum [1998]), the recent tendency is
to adopt it, as it reduces both the dimen-
sionality of the termspace (see Section5.3)
and the stochastic dependence between
terms (see Section 6.2).
Depending on the application, either
the full text of the document or selected
parts of it are indexed. While the former
option is the rule, exceptions exist. For
instance, in a patent categorization ap-
plication Larkey [1999] indexed only the
title, the abstract, the first 20 lines of
the summary, and the section containing
6
One application of TC in which it would be inap-
propriate to remove function words is author identi-
fication for documents of disputed paternity. In fact,
as noted in Manning and Sch¨ utze [1999], page 589,
“it is often the ‘little’ words that give an author away
(for example, the relative frequencies of words like
because or though).”
the claims of novelty of the described in-
vention. This approach was made possi-
ble by the fact that documents describing
patents are structured. Similarly, when a
document title is available, one can pay
extra importance to the words it contains
[Apt´ e et al. 1994; Cohen and Singer 1999;
Weiss et al. 1999]. When documents are
flat, identifying the most relevant part of
a document is instead a nonobvious task.
5.2. The Darmstadt Indexing Approach
The AIR/X system [Fuhr et al. 1991] oc-
cupies a special place in the literature on
indexing for TC. This system is the final
result of the AIR project, one of the most
important efforts in the history of TC:
spanning a duration of more than 10 years
[Knorz 1982; Tzeras and Hartmann 1993],
it has produced a system operatively em-
ployed since 1985 in the classification of
corpora of scientific literature of O(10
5
)
documents and O(10
4
) categories, and has
had important theoretical spin-offs in the
field of probabilistic indexing [Fuhr 1989;
Fuhr and Buckely 1991].
7
The approach to indexing taken in
AIR/X is known as the Darmstadt In-
dexing Approach (DIA) [Fuhr 1985].
Here, “indexing” is used in the sense of
Section 3.1, that is, as using terms from
a controlled vocabulary, and is thus a
synonym of TC (the DIA was later ex-
tended to indexing with free terms [Fuhr
and Buckley 1991]). The idea that under-
lies the DIA is the use of a much wider
set of “features” than described in Sec-
tion 5.1. All other approaches mentioned
in this paper view terms as the dimen-
sions of the learning space, where terms
may be single words, stems, phrases, or
(see Sections 5.5.1 and 5.5.2) combina-
tions of any of these. In contrast, the DIA
considers properties (of terms, documents,
7
The AIR/X system, its applications (including the
AIR/PHYS system [Biebricher et al. 1988], an appli-
cation of AIR/X to indexing physics literature), and
its experiments have also been richly documented
in a series of papers and doctoral theses written in
German. The interested reader may consult Fuhr
et al. [1991] for a detailed bibliography.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 13
categories, or pairwise relationships am-
ong these) as basic dimensions of the
learning space. Examples of these are
—properties of a term t
k
: e.g. the idf of t
k
;
—properties of the relationship between a
termt
k
and a document d
j
: for example,
the t f of t
k
in d
j
; or the location (e.g., in
the title, or in the abstract) of t
k
within
d
j
;
—properties of a document d
j
: for exam-
ple, the length of d
j
;
—properties of a category c
i
: for example,
the training set generality of c
i
.
For each possible document-category pair,
the values of these features are collected
in a so-called relevance description vec-
tor
¯
rd(d
j
, c
i
). The size of this vector is
determined by the number of properties
considered, and is thus independent of
specific terms, categories, or documents
(for multivalued features, appropriate ag-
gregation functions are applied in order
to yield a single value to be included in
¯
rd(d
j
, c
i
)); in this way an abstraction from
specific terms, categories, or documents is
achieved.
The main advantage of this approach
is the possibility to consider additional
features that can hardly be accounted for
in the usual term-based approaches, for
example, the location of a term within a
document, or the certainty with which a
phrase was identified in a document. The
term-category relationship is described by
estimates, derived fromthe training set, of
the probability P(c
i
[ t
k
) that a document
belongs to category c
i
, given that it con-
tains termt
k
(the DIA association factor).
8
Relevance description vectors
¯
rd(d
j
, c
i
)
are then the final representations that
are used for the classification of document
d
j
under category c
i
.
The essential ideas of the DIA—
transforming the classification space by
means of abstraction and using a more de-
tailed text representation than the stan-
dard bag-of-words approach—have not
8
Association factors are called adhesion coefficients
in many early papers on TC; see Field [1975];
Robertson and Harding [1984].
been taken up by other researchers so
far. For new TC applications dealing with
structured documents or categorization of
Web pages, these ideas may become of in-
creasing importance.
5.3. Dimensionality Reduction
Unlike in text retrieval, in TC the high
dimensionality of the term space (i.e.,
the large value of [T [) may be problem-
atic. In fact, while typical algorithms used
in text retrieval (such as cosine match-
ing) can scale to high values of [T [, the
same does not hold of many sophisticated
learning algorithms used for classifier in-
duction (e.g., the LLSF algorithm of Yang
and Chute [1994]). Because of this, be-
fore classifier induction one often applies
a pass of dimensionality reduction (DR),
whose effect is to reduce the size of the
vector space from[T [ to [T
/
[ <[T [; the set
T
/
is called the reduced term set.
DR is also beneficial since it tends to re-
duce overfitting, that is, the phenomenon
by which a classifier is tuned also to
the contingent characteristics of the train-
ing data rather than just the constitu-
tive characteristics of the categories. Clas-
sifiers that overfit the training data are
good at reclassifying the data they have
been trained on, but much worse at clas-
sifying previously unseen data. Experi-
ments have shown that, in order to avoid
overfitting a number of training exam-
ples roughly proportional to the number
of terms used is needed; Fuhr and Buckley
[1991, page 235] have suggested that 50–
100 training examples per term may be
needed inTCtasks. This means that, if DR
is performed, overfitting may be avoided
even if a smaller amount of training exam-
ples is used. However, in removing terms
the risk is to remove potentially useful
information on the meaning of the docu-
ments. It is then clear that, in order to
obtain optimal (cost-)effectiveness, the re-
duction process must be performed with
care. Various DR methods have been pro-
posed, either from the information theory
or from the linear algebra literature, and
their relative merits have been tested by
experimentally evaluating the variation
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
14 Sebastiani
in effectiveness that a given classifier
undergoes after application of the function
to the term space.
There are two distinct ways of view-
ing DR, depending on whether the task is
performed locally (i.e., for each individual
category) or globally:
—local DR: for each category c
i
, a set T
/
i
of
terms, with[T
/
i
[ <[T [, is chosenfor clas-
sificationunder c
i
(see Apt´ e et al. [1994];
Lewis and Ringuette [1994]; Li and
Jain [1998]; Ng et al. [1997]; Sable and
Hatzivassiloglou [2000]; Sch¨ utze et al.
[1995], Wiener et al. [1995]). This means
that different subsets of
¯
d
j
are used
when working with the different cate-
gories. Typical values are 10≤[T
/
i
[ ≤50.
—global DR: a set T
/
of terms, with
[T
/
[ < [T [, is chosen for the classifica-
tion under all categories C ={c
1
, . . . , c
[C[
}
(see Caropreso et al. [2001]; Mladeni´ c
[1998]; Yang [1999]; Yang and Pedersen
[1997]).
This distinction usually does not impact
on the choice of DR technique, since
most such techniques can be used (and
have been used) for local and global
DR alike (supervised DR techniques—see
Section 5.5.1—are exceptions to this rule).
In the rest of this section, we will assume
that the global approach is used, although
everything we will say also applies to the
local approach.
A second, orthogonal distinction may be
drawn in terms of the nature of the result-
ing terms:
—DR by term selection: T
/
is a subset
of T ;
—DR by term extraction: the terms in
T
/
are not of the same type of the
terms in T (e.g., if the terms in T are
words, the terms in T
/
may not be words
at all), but are obtained by combina-
tions or transformations of the original
ones.
Unlike in the previous distinction, these
two ways of doing DR are tackled by very
different techniques; we will address them
separately in the next two sections.
5.4. Dimensionality Reduction
by Term Selection
Given a predetermined integer r, tech-
niques for term selection (also called term
space reduction—TSR) attempt to select,
from the original set T , the set T
/
of
terms (with [T
/
[ < [T [) that, when used
for document indexing, yields the highest
effectiveness. Yang and Pedersen [1997]
have shown that TSR may even result in
a moderate (≤5%) increase in effective-
ness, depending on the classifier, on the
aggressivity
[T [
[T
/
[
of the reduction, and on
the TSR technique used.
Moulinier et al. [1996] have used a so-
called wrapper approach, that is, one in
which T
/
is identified by means of the
same learning method that will be used for
building the classifier [John et al. 1994].
Starting from an initial term set, a new
term set is generated by either adding
or removing a term. When a new term
set is generated, a classifier based on it
is built and then tested on a validation
set. The term set that results in the best
effectiveness is chosen. This approach has
the advantage of being tuned to the learn-
ing algorithm being used; moreover, if lo-
cal DR is performed, different numbers of
terms for different categories may be cho-
sen, depending on whether a category is
or is not easily separable from the others.
However, the sheer size of the space of dif-
ferent termsets makes its cost-prohibitive
for standard TC applications.
A computationally easier alternative is
the filtering approach [John et al. 1994],
that is, keeping the [T
/
[ < [T [ terms that
receive the highest score according to a
function that measures the “importance”
of the termfor the TCtask. We will explore
this solution in the rest of this section.
5.4.1. Document Frequency. Asimple and
effective global TSR function is the docu-
ment frequency #
Tr
(t
k
) of a term t
k
, that is,
only the terms that occur in the highest
number of documents are retained. In a
series of experiments Yang and Pedersen
[1997] have shown that with #
Tr
(t
k
) it is
possible to reduce the dimensionality by a
factor of 10 with no loss in effectiveness (a
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 15
reduction by a factor of 100 bringing about
just a small loss).
This seems to indicate that the terms
occurring most frequently in the collection
are the most valuable for TC. As such, this
would seem to contradict a well-known
“law” of IR, according to which the terms
with low-to-medium document frequency
are the most informative ones [Salton and
Buckley 1988]. But these two results do
not contradict each other, since it is well
known (see Salton et al. [1975]) that the
large majority of the words occurring in
a corpus have a very low document fre-
quency; this means that by reducing the
term set by a factor of 10 using document
frequency, only such words are removed,
while the words from low-to-medium to
high document frequency are preserved.
Of course, stop words need to be removed
in advance, lest only topic-neutral words
are retained [Mladeni´ c 1998].
Finally, note that a slightly more empir-
ical form of TSR by document frequency
is adopted by many authors, who remove
all terms occurring in at most x train-
ing documents (popular values for x range
from 1 to 3), either as the only form of DR
[Maron 1961; Ittner et al. 1995] or before
applying another more sophisticated form
[Dumais et al. 1998; Li and Jain 1998]. A
variant of this policy is removing all terms
that occur at most x times in the train-
ing set (e.g., Dagan et al. [1997]; Joachims
[1997]), with popular values for x rang-
ing from 1 (e.g., Baker and McCallum
[1998]) to 5 (e.g., Apt´ e et al. [1994]; Cohen
[1995a]).
5.4.2. Other Information-Theoretic Term
Selection Functions. Other more sophis-
ticated information-theoretic functions
have been used in the literature, among
them the DIA association factor [Fuhr
et al. 1991], chi-square [Caropreso et al.
2001; Galavotti et al. 2000; Sch¨ utze et al.
1995; Sebastiani et al. 2000; Yang and
Pedersen 1997; Yang and Liu 1999],
NGL coefficient [Ng et al. 1997; Ruiz
and Srinivasan 1999], information gain
[Caropreso et al. 2001; Larkey 1998;
Lewis 1992a; Lewis and Ringuette 1994;
Mladeni´ c 1998; Moulinier and Ganascia
1996; Yang and Pedersen 1997, Yang and
Liu 1999], mutual information [Dumais
et al. 1998; Lam et al. 1997; Larkey
and Croft 1996; Lewis and Ringuette
1994; Li and Jain 1998; Moulinier et al.
1996; Ruiz and Srinivasan 1999; Taira
and Haruno 1999; Yang and Pedersen
1997], odds ratio [Caropreso et al. 2001;
Mladeni´ c 1998; Ruiz and Srinivasan
1999], relevancy score [Wiener et al.
1995], and GSS coefficient [Galavotti
et al. 2000]. The mathematical definitions
of these measures are summarized for
convenience in Table I.
9
Here, probabil-
ities are interpreted on an event space
of documents (e.g., P(¯ t
k
, c
i
) denotes the
probability that, for a random document
x, term t
k
does not occur in x and x
belongs to category c
i
), and are estimated
by counting occurrences in the training
set. All functions are specified “locally” to
a specific category c
i
; in order to assess the
value of a term t
k
in a “global,” category-
independent sense, either the sum
f
sum
(t
k
) =
¸
[C[
i=1
f (t
k
, c
i
), or the weighted
sum f
wsum
(t
k
) =
¸
[C[
i=1
P(c
i
) f (t
k
, c
i
), or the
maximum f
max
(t
k
) = max
[C[
i=1
f (t
k
, c
i
) of
their category-specific values f (t
k
, c
i
) are
usually computed.
These functions try to capture the in-
tuition that the best terms for c
i
are the
ones distributed most differently in the
sets of positive and negative examples of
c
i
. However, interpretations of this prin-
ciple vary across different functions. For
instance, in the experimental sciences χ
2
is used to measure how the results of an
observation differ (i.e., are independent)
from the results expected according to an
initial hypothesis (lower values indicate
lower dependence). InDRwe measure how
independent t
k
and c
i
are. The terms t
k
9
For better uniformity Table I views all the TSR
functions of this section in terms of subjective proba-
bility. In some cases such as χ
2
(t
k
, c
i
) this is slightly
artificial, since this function is not usually viewed in
probabilistic terms. The formulae refer to the “local”
(i.e., category-specific) forms of the functions, which
again is slightly artificial in some cases. Note that
the NGL and GSS coefficients are here named after
their authors, since they had originally been given
names that might generate some confusion if used
here.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
16 Sebastiani
Table I. Main Functions Used for Term Space Reduction Purposes. Information Gain Is Also Known as
Expected Mutual Information, and Is Used Under This Name by Lewis [1992a, page 44] and
Larkey [1998]. In the RS(t
k
, c
i
) Formula, d Is a Constant Damping Factor.
Function Denoted by Mathematical form
DIA association factor z(t
k
, c
i
) P(c
i
[ t
k
)
Information gain IG(t
k
, c
i
)
¸
c∈{c
i
, ¯ c
i
}
¸
t∈{t
k
, ¯ t
k
}
P(t, c) · log
P(t, c)
P(t) · P(c)
Mutual information MI(t
k
, c
i
) log
P(t
k
, c
i
)
P(t
k
) · P(c
i
)
Chi-square χ
2
(t
k
, c
i
)
[Tr[ · [P(t
k
, c
i
) · P(¯ t
k
, ¯ c
i
) − P(t
k
, ¯ c
i
) · P(¯ t
k
, c
i
)]
2
P(t
k
) · P(¯ t
k
) · P(c
i
) · P( ¯ c
i
)
NGL coefficient NGL(t
k
, c
i
)

[Tr[ · [P(t
k
, c
i
) · P(¯ t
k
, ¯ c
i
) − P(t
k
, ¯ c
i
) · P(¯ t
k
, c
i
)]

P(t
k
) · P(¯ t
k
) · P(c
i
) · P( ¯ c
i
)
Relevancy score RS(t
k
, c
i
) log
P(t
k
[ c
i
) ÷d
P(¯ t
k
[ ¯ c
i
) ÷d
Odds ratio OR(t
k
, c
i
)
P(t
k
[ c
i
) · (1 − P(t
k
[ ¯ c
i
))
(1 − P(t
k
[ c
i
)) · P(t
k
[ ¯ c
i
)
GSS coefficient GSS(t
k
, c
i
) P(t
k
, c
i
) · P(¯ t
k
, ¯ c
i
) − P(t
k
, ¯ c
i
) · P(¯ t
k
, c
i
)
with the lowest value for χ
2
(t
k
, c
i
) are thus
the most independent from c
i
; since we
are interested in the terms which are not,
we select the terms for which χ
2
(t
k
, c
i
) is
highest.
While each TSR function has its own
rationale, the ultimate word on its value
is the effectiveness it brings about. Var-
ious experimental comparisons of TSR
functions have thus been carried out
[Caropreso et al. 2001; Galavotti et al.
2000; Mladeni´ c 1998; Yang and Pedersen
1997]. In these experiments most func-
tions listed in Table I (with the possible
exception of MI) have improved on the re-
sults of document frequency. For instance,
Yang and Pedersen [1997] have shown
that, with various classifiers and various
initial corpora, sophisticated techniques
such as IG
sum
(t
k
, c
i
) or χ
2
max
(t
k
, c
i
) can re-
duce the dimensionality of the term space
by a factor of 100 with no loss (or even
with a small increase) of effectiveness.
Collectively, the experiments reported in
the above-mentioned papers seem to in-
dicate that {OR
sum
, NGL
sum
, GSS
max
} >

2
max
, IG
sum
} >{χ
2
wavg
} ·{MI
max
, MI
wsum
},
where “>” means “performs better than.”
However, it should be noted that these
results are just indicative, and that more
general statements on the relative mer-
its of these functions could be made only
as a result of comparative experiments
performed in thoroughly controlled condi-
tions and on a variety of different situ-
ations (e.g., different classifiers, different
initial corpora, . . . ).
5.5. Dimensionality Reduction
by Term Extraction
Given a predetermined [T
/
[ <[T [, termex-
traction attempts to generate, from the
original set T , a set T
/
of “synthetic”
terms that maximize effectiveness. The
rationale for using synthetic (rather than
naturally occurring) terms is that, due
to the pervasive problems of polysemy,
homonymy, and synonymy, the original
terms may not be optimal dimensions
for document content representation.
Methods for term extraction try to solve
these problems by creating artificial terms
that do not suffer fromthem. Any termex-
traction method consists in (i) a method
for extracting the new terms from the
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 17
old ones, and (ii) a method for convert-
ing the original document representa-
tions into new representations based on
the newly synthesized dimensions. Two
term extraction methods have been exper-
imented with in TC, namely term cluster-
ing and latent semantic indexing.
5.5.1. Term Clustering. Term clustering
tries to group words with a high degree of
pairwise semantic relatedness, so that the
groups (or their centroids, or a represen-
tative of them) may be used instead of the
terms as dimensions of the vector space.
Term clustering is different from term se-
lection, since the former tends to address
terms synonymous (or near-synonymous)
with other terms, while the latter targets
noninformative terms.
10
Lewis [1992a] was the first to inves-
tigate the use of term clustering in TC.
The method he employed, called recipro-
cal nearest neighbor clustering, consists
in creating clusters of two terms that are
one the most similar to the other accord-
ing to some measure of similarity. His re-
sults were inferior to those obtained by
single-word indexing, possibly due to a dis-
appointing performance by the clustering
method: as Lewis [1992a, page 48] said,
“The relationships captured in the clus-
ters are mostly accidental, rather than the
systematic relationships that were hoped
for.”
Li and Jain [1998] viewed semantic
relatedness between words in terms of
their co-occurrence and co-absence within
training documents. By using this tech-
nique in the context of a hierarchical
clustering algorithm, they witnessed only
a marginal effectiveness improvement;
however, the small size of their experiment
(see Section 6.11) hardly allows any defini-
tive conclusion to be reached.
Both Lewis [1992a] and Li and Jain
[1998] are examples of unsupervised clus-
tering, since clustering is not affected by
the category labels attached to the docu-
10
Some term selection methods, such as wrapper
methods, also address the problem of redundancy.
ments. Baker and McCallum [1998] pro-
vided instead an example of supervised
clustering, as the distributional clustering
method they employed clusters together
those terms that tend to indicate the pres-
ence of the same category, or group of cat-
egories. Their experiments, carried out in
the context of a Na¨ıve Bayes classifier (see
Section 6.2), showed only a 2% effective-
ness loss with an aggressivity of 1,000,
and even showed some effectiveness im-
provement with less aggressive levels of
reduction. Later experiments by Slonim
and Tishby [2001] have confirmed the po-
tential of supervised clustering methods
for term extraction.
5.5.2. Latent Semantic Indexing. Latent se-
mantic indexing (LSI—[Deerwester et al.
1990]) is a DR technique developed in IR
in order to address the problems deriv-
ing from the use of synonymous, near-
synonymous, and polysemous words as
dimensions of document representations.
This technique compresses document vec-
tors into vectors of a lower-dimensional
space whose dimensions are obtained
as combinations of the original dimen-
sions by looking at their patterns of co-
occurrence. In practice, LSI infers the
dependence among the original terms
froma corpus and “wires” this dependence
into the newly obtained, independent di-
mensions. The function mapping original
vectors into new vectors is obtained by ap-
plying a singular value decomposition to
the matrix formed by the original docu-
ment vectors. In TC this technique is ap-
plied by deriving the mapping function
from the training set and then applying
it to training and test documents alike.
One characteristic of LSI is that the
newly obtained dimensions are not, unlike
in term selection and term clustering,
intuitively interpretable. However, they
work well in bringing out the “latent”
semantic structure of the vocabulary
used in the corpus. For instance, Sch¨ utze
et al. [1995, page 235] discussed the clas-
sification under category Demographic
shifts in the U.S. with economic impact of
a document that was indeed a positive
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
18 Sebastiani
test instance for the category, and that
contained, among others, the quite reveal-
ing sentence The nation grew to 249.6
million people in the 1980s as more
Americans left the industrial and ag-
ricultural heartlands for the South
and West. The classifier decision was in-
correct when local DRhad been performed
by χ
2
-based term selection retaining the
top original 200 terms, but was correct
when the same task was tackled by
means of LSI. This well exemplifies
how LSI works: the above sentence does
not contain any of the 200 terms most
relevant to the category selected by χ
2
,
but quite possibly the words contained in
it have concurred to produce one or more
of the LSI higher-order terms that gener-
ate the document space of the category.
As Sch¨ utze et al. [1995, page 230] put it,
“if there is a great number of terms which
all contribute a small amount of critical
information, then the combination of evi-
dence is a major problem for a term-based
classifier.” A drawback of LSI, though, is
that if some original term is particularly
good in itself at discriminating a category,
that discrimination power may be lost in
the new vector space.
Wiener et al. [1995] used LSI in two
alternative ways: (i) for local DR, thus
creating several category-specific LSI
representations, and (ii) for global DR,
thus creating a single LSI representa-
tion for the entire category set. Their
experiments showed the former approach
to perform better than the latter, and
both approaches to perform better than
simple TSR based on Relevancy Score
(see Table I).
Sch¨ utze et al. [1995] experimentally
compared LSI-based term extraction with
χ
2
-based TSR using three different clas-
sifier learning techniques (namely, linear
discriminant analysis, logistic regression,
and neural networks). Their experiments
showed LSI to be far more effective than
χ
2
for the first two techniques, while both
methods performed equally well for the
neural network classifier.
For other TC works that have used
LSI or similar termextraction techniques,
see Hull [1994], Li and Jain [1998],
Sch¨ utze [1998], Weigend et al. [1999], and
Yang [1995].
6. INDUCTIVE CONSTRUCTION
OF TEXT CLASSIFIERS
The inductive construction of text clas-
sifiers has been tackled in a variety of
ways. Here we will deal only with the
methods that have been most popular
in TC, but we will also briefly mention
the existence of alternative, less standard
approaches.
We start by discussing the general
form that a text classifier has. Let us
recall from Section 2.4 that there are
two alternative ways of viewing classi-
fication: “hard” (fully automated) clas-
sification and ranking (semiautomated)
classification.
The inductive construction of a ranking
classifier for category c
i
∈ C usually con-
sists in the definition of a function CSV
i
:
D→[0, 1] that, given a document d
j
, re-
turns a categorization status value for it,
that is, a number between 0 and 1 which,
roughly speaking, represents the evidence
for the fact that d
j
∈c
i
. Documents are
thenranked according to their CSV
i
value.
This works for “document-ranking TC”;
“category-ranking TC” is usually tackled
by ranking, for a given document d
j
, its
CSV
i
scores for the different categories in
C = {c
1
, . . . , c
[C[
}.
The CSV
i
function takes up differ-
ent meanings according to the learn-
ing method used: for instance, in the
“Na¨ıve Bayes” approach of Section 6.2
CSV
i
(d
j
) is defined in terms of a proba-
bility, whereas in the “Rocchio” approach
discussed inSection6.7 CSV
i
(d
j
) is a mea-
sure of vector closeness in [T [-dimensional
space.
The construction of a “hard” classi-
fier may follow two alternative paths.
The former consists in the definition of
a function CSV
i
: D→{T, F}. The lat-
ter consists instead in the definition of
a function CSV
i
: D→[0, 1], analogous
to the one used for ranking classification,
followed by the definition of a threshold
τ
i
such that CSV
i
(d
j
) ≥τ
i
is interpreted
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 19
as T while CSV
i
(d
j
) <τ
i
is interpreted
as F.
11
The definition of thresholds will be the
topic of Section 6.1. In Sections 6.2 to 6.12
we will instead concentrate on the defini-
tion of CSV
i
, discussing a number of ap-
proaches that have been applied in the TC
literature. In general we will assume we
are dealing with “hard” classification; it
will be evident from the context how and
whether the approaches can be adapted to
ranking classification. The presentation of
the algorithms will be mostly qualitative
rather than quantitative, that is, will fo-
cus on the methods for classifier learning
rather than on the effectiveness and ef-
ficiency of the classifiers built by means
of them; this will instead be the focus of
Section 7.
6.1. Determining Thresholds
There are various policies for determin-
ing the threshold τ
i
, also depending on the
constraints imposed by the application.
The most important distinction is whether
the threshold is derived analytically or
experimentally.
The former method is possible only in
the presence of a theoretical result that in-
dicates how to compute the threshold that
maximizes the expected value of the ef-
fectiveness function [Lewis 1995a]. This is
typical of classifiers that output probabil-
ity estimates of the membership of d
j
in c
i
(see Section6.2) andwhose effectiveness is
computed by decision-theoretic measures
such as utility (see Section 7.1.3); we thus
defer the discussion of this policy (which
is called probability thresholding in Lewis
[1995a]) to Section 7.1.3.
When such a theoretical result is not
known, one has to revert to the latter
method, which consists in testing different
values for τ
i
on a validation set and choos-
ing the value which maximizes effective-
ness. We call this policy CSV thresholding
11
Alternative methods are possible, such as train-
ing a classifier for which some standard, predefined
value such as 0 is the threshold. For ease of exposi-
tion we will not discuss them.
[Cohen and Singer 1999; Schapire et al.
1998; Wiener et al. 1995]; it is also called
Scut in Yang [1999]. Different τ
i
’s are typ-
ically chosen for the different c
i
’s.
A second, popular experimental pol-
icy is proportional thresholding [Iwayama
and Tokunaga 1995; Larkey 1998; Lewis
1992a; Lewis and Ringuette 1994; Wiener
et al. 1995], also called Pcut in Yang
[1999]. This policy consists in choosing
the value of τ
i
for which g
Va
(c
i
) is clos-
est to g
Tr
(c
i
), and embodies the principle
that the same percentage of documents of
both training and test set should be clas-
sified under c
i
. For obvious reasons, this
policy does not lend itself to document-
pivoted TC.
Sometimes, depending on the applica-
tion, a fixed thresholding policy (a.k.a.
“k-per-doc” thresholding [Lewis 1992a] or
Rcut [Yang 1999]) is applied, whereby it is
stipulated that a fixed number k of cate-
gories, equal for all d
j
’s, are to be assigned
to each document d
j
. This is often used,
for instance, in applications of TC to au-
tomated document indexing [Field 1975;
Lam et al. 1999]. Strictly speaking, how-
ever, this is not a thresholding policy inthe
sense definedat the beginning of Section6,
as it might happen that d
/
is classified un-
der c
i
, d
//
is not, and CSV
i
(d
/
) < CSV
i
(d
//
).
Quite clearly, this policy is mostly at home
with document-pivoted TC. However, it
suffers from a certain coarseness, as the
fact that k is equal for all documents (nor
could this be otherwise) allows no fine-
tuning.
In his experiments Lewis [1992a] found
the proportional policy to be superior to
probability thresholding when microaver-
aged effectiveness was tested but slightly
inferior with macroaveraging (see Section
7.1.1). Yang [1999] found instead CSV
thresholding to be superior to proportional
thresholding (possibly due to her category-
specific optimization on a validation set),
and found fixed thresholding to be con-
sistently inferior to the other two poli-
cies. The fact that these latter results have
been obtained across different classifiers
no doubt reinforces them.
In general, aside from the considera-
tions above, the choice of the thresholding
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
20 Sebastiani
policy may also be influenced by the
application; for instance, in applying a
text classifier to document indexing for
Boolean systems a fixed thresholding pol-
icy might be chosen, while a proportional
or CSVthresholding method might be cho-
sen for Web page classification under hier-
archical catalogues.
6.2. Probabilistic Classifiers
Probabilistic classifiers (see Lewis [1998]
for a thorough discussion) view CSV
i
(d
j
)
in terms of P(c
i
[
¯
d
j
), that is, the proba-
bility that a document represented by a
vector
¯
d
j
=!w
1 j
, . . . , w
[T [ j
) of (binary or
weighted) terms belongs to c
i
, and com-
pute this probability by an application of
Bayes’ theorem, given by
P(c
i
[
¯
d
j
) =
P(c
i
)P(
¯
d
j
[ c
i
)
P(
¯
d
j
)
. (3)
In (3) the event space is the space of docu-
ments: P(
¯
d
j
) is thus the probability that a
randomly picked document has vector
¯
d
j
as its representation, and P(c
i
) the prob-
ability that a randomly picked document
belongs to c
i
.
The estimation of P(
¯
d
j
[ c
i
) in (3) is
problematic, since the number of possible
vectors
¯
d
j
is too high (the same holds for
P(
¯
d
j
), but for reasons that will be clear
shortly this will not concern us). In or-
der to alleviate this problem it is com-
mon to make the assumption that any two
coordinates of the document vector are,
when viewed as random variables, statis-
tically independent of each other; this in-
dependence assumption is encoded by the
equation
P(
¯
d
j
[ c
i
) =
[T [
¸
k=1
P(w
kj
[ c
i
). (4)
Probabilistic classifiers that use this as-
sumption are called Na¨ıve Bayes clas-
sifiers, and account for most of the
probabilistic approaches to TC in the lit-
erature (see Joachims [1998]; Koller and
Sahami [1997]; Larkey and Croft [1996];
Lewis [1992a]; Lewis and Gale [1994];
Li and Jain [1998]; Robertson and
Harding [1984]). The “na¨ıve” character of
the classifier is due to the fact that usu-
ally this assumption is, quite obviously,
not verified in practice.
One of the best-known Na¨ıve Bayes ap-
proaches is the binary independence clas-
sifier [Robertson and Sparck Jones 1976],
which results from using binary-valued
vector representations for documents. In
this case, if we write p
ki
as short for
P(w
kx
= 1 [ c
i
), the P(w
kj
[ c
i
) factors of
(4) may be written as
P(w
kj
[ c
i
) = p
w
kj
ki
(1 − p
ki
)
1−w
kj
=

p
ki
1 − p
ki

w
kj
(1 − p
ki
). (5)
We may further observe that in TC the
document space is partitioned into two
categories,
12
c
i
and its complement ¯ c
i
, such
that P(¯ c
i
[
¯
d
j
) =1 − P(c
i
[
¯
d
j
). If we plug
in (4) and (5) into (3) and take logs we
obtain
log P(c
i
[
¯
d
j
)
= log P(c
i
) ÷
[T [
¸
k=1
w
kj
log
p
ki
1 − p
ki
÷
[T [
¸
k=1
log(1 − p
ki
) − log P(
¯
d
j
) (6)
log(1 − P(c
i
[
¯
d
j
))
= log(1 − P(c
i
)) ÷
[T [
¸
k=1
w
kj
log
p
k
¯
i
1 − p
k
¯
i
÷
[T [
¸
k=1
log(1 − p
k
¯
i
) − log P(
¯
d
j
), (7)
12
Cooper [1995] has pointed out that in this case
the full independence assumption of (4) is not ac-
tually made in the Na¨ıve Bayes classifier; the as-
sumption needed here is instead the weaker linked
dependence assumption, which may be written as
P(
¯
d
j
[ c
i
)
P(
¯
d
j
[ ¯ c
i
)
=
¸
[T [
k=1
P(w
kj
[ c
i
)
P(w
kj
[ ¯ c
i
)
.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 21
where we write p
k
¯
i
as short for
P(w
kx
= 1 [ ¯ c
i
). We may convert (6) and (7)
into a single equation by subtracting com-
ponentwise (7) from (6), thus obtaining
log
P(c
i
[
¯
d
j
)
1 − P(c
i
[
¯
d
j
)
= log
P(c
i
)
1 − P(c
i
)
÷
[T [
¸
k=1
w
kj
log
p
ki
(1 − p
k
¯
i
)
p
k
¯
i
(1 − p
ki
)
÷
[T [
¸
k=1
log
1 − p
ki
1 − p
k
¯
i
. (8)
Note that
P(c
i
[
¯
d
j
)
1−P(c
i
[
¯
d
j
)
is an increasing mono-
tonic function of P(c
i
[
¯
d
j
), and may thus
be used directly as CSV
i
(d
j
). Note also
that log
P(c
i
)
1−P(c
i
)
and
¸
[T [
k=1
log
1−p
ki
1−p
k
¯
i
are
constant for all documents, and may
thus be disregarded.
13
Defining a clas-
sifier for category c
i
thus basically re-
quires estimating the 2[T [ parameters
{ p
1i
, p
1
¯
i
, . . . , p
[T [i
, p
[T [
¯
i
} from the training
data, which may be done in the obvious
way. Note that in general the classifica-
tion of a given document does not re-
quire one to compute a sum of [T [ factors,
as the presence of
¸
[T [
k=1
w
kj
log
p
ki
(1−p
k
¯
i
)
p
k
¯
i
(1−p
ki
)
would imply; in fact, all the factors for
which w
kj
= 0 may be disregarded, and
this accounts for the vast majority of them,
since document vectors are usually very
sparse.
The method we have illustrated is just
one of the many variants of the Na¨ıve
Bayes approach, the common denomina-
tor of which is (4). Arecent paper by Lewis
[1998] is an excellent roadmap on the
various directions that research on Na¨ıve
Bayes classifiers has taken; among these
are the ones aiming
—to relax the constraint that document
vectors should be binary-valued. This
13
This is not true, however, if the “fixed threshold-
ing” method of Section 6.1 is adopted. In fact, for a
fixed document d
j
the first and third factor in the for-
mula above are different for different categories, and
may therefore influence the choice of the categories
under which to file d
j
.
looks natural, given that weighted in-
dexing techniques (see Fuhr [1989];
Salton and Buckley [1988]) accounting
for the “importance” of t
k
for d
j
play a
key role in IR.
—to introduce document length normal-
ization. The value of log
P(c
i
[
¯
d
j
)
1−P(c
i
[
¯
d
j
)
tends
to be more extreme (i.e., very high
or very low) for long documents (i.e.,
documents such that w
kj
=1 for many
values of k), irrespectively of their
semantic relatedness to c
i
, thus call-
ing for length normalization. Taking
length into account is easy in non-
probabilistic approaches to classifica-
tion (see Section 6.7), but is problematic
in probabilistic ones (see Lewis [1998],
Section 5). One possible answer is to
switch from an interpretation of Na¨ıve
Bayes in which documents are events to
one in which terms are events [Baker
and McCallum 1998; McCallum et al.
1998; Chakrabarti et al. 1998a; Guthrie
et al. 1994]. This accounts for document
length naturally but, as noted by Lewis
[1998], has the drawback that differ-
ent occurrences of the same word within
the same document are viewed as in-
dependent, an assumption even more
implausible than (4).
—to relax the independence assumption.
This may be the hardest route to follow,
since this produces classifiers of higher
computational cost and characterized
by harder parameter estimation prob-
lems [Koller and Sahami 1997]. Earlier
efforts in this direction within proba-
bilistic text search (e.g., vanRijsbergen
[1977]) have not shown the perfor-
mance improvements that were hoped
for. Recently, the fact that the binary in-
dependence assumption seldom harms
effectiveness has also been given some
theoretical justification [Domingos and
Pazzani 1997].
The quotation of text search in the last
paragraph is not casual. Unlike other
types of classifiers, the literature on prob-
abilistic classifiers is inextricably inter-
twined with that on probabilistic search
systems (see Crestani et al. [1998] for a
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
22 Sebastiani
Fig. 2. A decision tree equivalent to the DNF rule of Figure 1. Edges are labeled
by terms and leaves are labeled by categories (underlining denotes negation).
review), since these latter attempt to de-
termine the probability that a document
falls in the category denoted by the query,
and since they are the only search systems
that take relevance feedback, a notion es-
sentially involving supervised learning, as
central.
6.3. Decision Tree Classifiers
Probabilistic methods are quantitative
(i.e., numeric) in nature, and as such
have sometimes been criticized since, ef-
fective as they may be, they are not eas-
ily interpretable by humans. A class of
algorithms that do not suffer from this
problem are symbolic (i.e., nonnumeric)
algorithms, among which inductive rule
learners (which we will discuss in Sec-
tion 6.4) and decision tree learners are the
most important examples.
A decision tree (DT) text classifier (see
Mitchell [1996], Chapter 3) is a tree in
which internal nodes are labeled by terms,
branches departing from them are labeled
by tests on the weight that the termhas in
the test document, and leafs are labeled by
categories. Such a classifier categorizes a
test document d
j
by recursively testing for
the weights that the terms labeling the in-
ternal nodes have in vector
¯
d
j
, until a leaf
node is reached; the label of this node is
then assigned to d
j
. Most such classifiers
use binary document representations, and
thus consist of binary trees. An example
DT is illustrated in Figure 2.
There are a number of standard pack-
ages for DT learning, and most DT ap-
proaches to TC have made use of one such
package. Among the most popular ones are
ID3 (used by Fuhr et al. [1991]), C4.5 (used
by Cohen and Hirsh [1998], Cohen and
Singer [1999], Joachims [1998], and Lewis
and Catlett [1994]), and C5 (used by Li
and Jain [1998]). TC efforts based on ex-
perimental DT packages include Dumais
et al. [1998], Lewis and Ringuette [1994],
and Weiss et al. [1999].
A possible method for learning a DT
for category c
i
consists in a “divide and
conquer” strategy of (i) checking whether
all the training examples have the same
label (either c
i
or ¯ c
i
); (ii) if not, select-
ing a term t
k
, partitioning Tr into classes
of documents that have the same value
for t
k
, and placing each such class in a
separate subtree. The process is recur-
sively repeated on the subtrees until each
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 23
leaf of the tree so generated contains train-
ing examples assigned to the same cate-
gory c
i
, which is then chosen as the label
for the leaf. The key step is the choice of
the term t
k
on which to operate the parti-
tion, a choice which is generally made ac-
cording to an information gain or entropy
criterion. However, such a “fully grown”
tree may be prone to overfitting, as some
branches may be too specific to the train-
ing data. Most DT learning methods thus
include a method for growing the tree and
one for pruning it, that is, for removing
the overly specific branches. Variations on
this basic schema for DT learning abound
[Mitchell 1996, Section 3].
DTtext classifiers have been used either
as the main classification tool [Fuhr et al.
1991; Lewis and Catlett 1994; Lewis and
Ringuette 1994], or as baseline classifiers
[Cohen and Singer 1999; Joachims 1998],
or as members of classifier committees [Li
and Jain 1998; Schapire and Singer 2000;
Weiss et al. 1999].
6.4. Decision Rule Classifiers
A classifier for category c
i
built by an
inductive rule learning method consists
of a DNF rule, that is, of a conditional
rule with a premise in disjunctive normal
form (DNF), of the type illustrated in
Figure 1.
14
The literals (i.e., possibly
negated keywords) in the premise denote
the presence (nonnegated keyword) or ab-
sence (negated keyword) of the keyword
in the test document d
j
, while the clause
head denotes the decision to classify d
j
under c
i
. DNF rules are similar to DTs
in that they can encode any Boolean func-
tion. However, an advantage of DNF rule
learners is that they tend to generate more
compact classifiers than DT learners.
Rule learning methods usually attempt
to select from all the possible covering
rules (i.e., rules that correctly classify
all the training examples) the “best” one
14
Many inductive rule learning algorithms build
decision lists (i.e., arbitrarily nested if-then-else
clauses) instead of DNF rules; since the former may
always be rewritten as the latter, we will disregard
the issue.
according to some minimality criterion.
While DTs are typically built by a top-
down, “divide-and-conquer” strategy, DNF
rules are often built in a bottom-up fash-
ion. Initially, every training example d
j
is
viewed as a clause η
1
, . . . , η
n
→γ
i
, where
η
1
, . . . , η
n
are the terms contained in d
j
and γ
i
equals c
i
or ¯ c
i
according to whether
d
j
is a positive or negative example of c
i
.
This set of clauses is already a DNF clas-
sifier for c
i
, but obviously scores high in
terms of overfitting. The learner applies
then a process of generalization in which
the rule is simplified through a series
of modifications (e.g., removing premises
from clauses, or merging clauses) that
maximize its compactness while at the
same time not affecting the “covering”
property of the classifier. At the end of
this process, a “pruning” phase similar in
spirit to that employed in DTs is applied,
where the ability to correctly classify all
the training examples is traded for more
generality.
DNF rule learners vary widely in terms
of the methods, heuristics and criteria
employed for generalization and prun-
ing. Among the DNF rule learners that
have been applied to TC are CHARADE
[Moulinier and Ganascia 1996], DL-ESC
[Li and Yamanishi 1999], RIPPER [Cohen
1995a; Cohen and Hirsh 1998; Cohen and
Singer 1999], SCAR [Moulinier et al. 1996],
and SWAP-1 [Apt´ e 1994].
While the methods above use rules
of propositional logic (PL), research has
also been carried out using rules of first-
order logic (FOL), obtainable through
the use of inductive logic programming
methods. Cohen [1995a] has extensively
compared PL and FOL learning in TC
(for instance, comparing the PL learner
RIPPER with its FOL version FLIPPER), and
has found that the additional represen-
tational power of FOL brings about only
modest benefits.
6.5. Regression Methods
Various TC efforts have used regression
models (see Fuhr and Pfeifer [1994]; Ittner
et al. [1995]; Lewis and Gale [1994];
Sch¨ utze et al. [1995]). Regression denotes
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
24 Sebastiani
the approximation of a real-valued (in-
stead than binary, as in the case of clas-
sification) function
˘
by means of a func-
tion that fits the training data [Mitchell
1996, page 236]. Here we will describe one
such model, the Linear Least-Squares Fit
(LLSF) applied to TC by Yang and Chute
[1994]. In LLSF, each document d
j
has
two vectors associated to it: an input vec-
tor I(d
j
) of [T [ weighted terms, and an
output vector O(d
j
) of [C[ weights rep-
resenting the categories (the weights for
this latter vector are binary for training
documents, and are nonbinary CSV
/
s for
test documents). Classification may thus
be seen as the task of determining an out-
put vector O(d
j
) for test document d
j
,
given its input vector I(d
j
); hence, build-
ing a classifier boils down to computing
a [C[ [T [ matrix
ˆ
M such that
ˆ
MI(d
j
) =
O(d
j
).
LLSF computes the matrix from the
training data by computing a linear least-
squares fit that minimizes the error on the
training set according to the formula
ˆ
M =
arg min
M
|MI − O|
F
, where arg min
M
(x)
stands as usual for the M for which x is
minimum, |V|
F
def
=

¸
[C[
i=1
¸
[T [
j =1
v
2
i j
rep-
resents the so-called Frobenius norm of a
[C[ [T [ matrix, I is the [T [ [Tr[ matrix
whose columns are the input vectors of the
training documents, and O is the [C[ [Tr[
matrix whose columns are the output vec-
tors of the training documents. The
ˆ
M ma-
trix is usually computed by performing a
singular value decomposition on the train-
ing set, and its generic entry ˆ m
ik
repre-
sents the degree of association between
category c
i
and term t
k
.
The experiments of Yang and Chute
[1994] and Yang and Liu [1999] indicate
that LLSF is one of the most effective text
classifiers known to date. One of its disad-
vantages, though, is that the cost of com-
puting the
ˆ
M matrix is much higher than
that of many other competitors in the TC
arena.
6.6. On-Line Methods
A linear classifier for category c
i
is a vec-
tor ¯ c
i
= !w
1i
, . . . , w
[T [i
) belonging to the
same [T [-dimensional space in which doc-
uments are also represented, and such
that CSV
i
(d
j
) corresponds to the dot
product
¸
[T [
k=1
w
ki
w
kj
of
¯
d
j
and ¯ c
i
. Note
that when both classifier and document
weights are cosine-normalized (see (2)),
the dot product between the two vec-
tors corresponds to their cosine similarity,
that is:
S(c
i
, d
j
) = cos(α)
=
¸
[T [
k=1
w
ki
· w
kj

¸
[T [
k=1
w
2
ki
·

¸
[T [
k=1
w
2
kj
,
which represents the cosine of the angle
α that separates the two vectors. This is
the similarity measure between query and
document computed by standard vector-
space IR engines, which means in turn
that once a linear classifier has been built,
classification can be performed by invok-
ing such an engine. Practically all search
engines have a dot product flavor to them,
and can therefore be adapted to doing TC
with a linear classifier.
Methods for learning linear classifiers
are often partitioned in two broad classes,
batch methods and on-line methods.
Batch methods build a classifier by ana-
lyzing the training set all at once. Within
the TC literature, one example of a batch
method is linear discriminant analysis,
a model of the stochastic dependence be-
tween terms that relies on the covari-
ance matrices of the categories [Hull 1994;
Sch¨ utze et al. 1995]. However, the fore-
most example of a batch method is the
Rocchio method; because of its importance
in the TC literature, this will be discussed
separately in Section 6.7. In this section
we will instead concentrate on on-line
methods.
On-line (a.k.a. incremental) methods
build a classifier soon after examining
the first training document, and incre-
mentally refine it as they examine new
ones. This may be an advantage in the
applications in which Tr is not avail-
able in its entirety from the start, or in
which the “meaning” of the category may
change in time, as for example, in adaptive
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 25
filtering. This is also apt to applications
(e.g., semiautomated classification, adap-
tive filtering) in which we may expect the
user of a classifier to provide feedback on
how test documents have been classified,
as in this case further training may be per-
formed during the operating phase by ex-
ploiting user feedback.
A simple on-line method is the per-
ceptron algorithm, first applied to TC by
Sch¨ utze et al. [1995] and Wiener et al.
[1995], and subsequently used by Dagan
et al. [1997] and Ng et al. [1997]. In this al-
gorithm, the classifier for c
i
is first initial-
ized by setting all weights w
ki
to the same
positive value. When a training example
d
j
(represented by a vector
¯
d
j
of binary
weights) is examined, the classifier built
so far classifies it. If the result of the clas-
sification is correct, nothing is done, while
if it is wrong, the weights of the classifier
are modified: if d
j
was a positive exam-
ple of c
i
, then the weights w
ki
of “active
terms” (i.e., the terms t
k
such that w
kj
=1)
are “promoted” by increasing them by a
fixed quantity α >0 (called learning rate),
while if d
j
was a negative example of c
i
then the same weights are “demoted” by
decreasing them by α. Note that when the
classifier has reached a reasonable level of
effectiveness, the fact that a weight w
ki
is
very lowmeans that t
k
has negatively con-
tributed to the classification process so far,
and may thus be discarded fromthe repre-
sentation. We may then see the perceptron
algorithm (as all other incremental learn-
ing methods) as allowing for a sort of “on-
the-fly termspace reduction” [Dagan et al.
1997, Section 4.4]. The perceptron classi-
fier has shown a good effectiveness in all
the experiments quoted above.
The perceptron is an additive weight-
updating algorithm. A multiplicative
variant of it is POSITIVE WINNOW [Dagan
et al. 1997], which differs from perceptron
because two different constants α
1
>1 and
0 < α
2
< 1 are used for promoting and de-
moting weights, respectively, and because
promotion and demotion are achieved by
multiplying, instead of adding, by α
1
and
α
2
. BALANCED WINNOW [Dagan et al. 1997]
is a further variant of POSITIVE WINNOW, in
which the classifier consists of two weights
w
÷
ki
and w

ki
for each term t
k
; the final
weight w
ki
used in computing the dot prod-
uct is the difference w
÷
ki
−w

ki
. Following
the misclassification of a positive in-
stance, active terms have their w
÷
ki
weight
promoted and their w

ki
weight demoted,
whereas in the case of a negative instance
it is w
÷
ki
that gets denoted while w

ki
gets
promoted (for the rest, promotions and
demotions are as in POSITIVE WINNOW).
BALANCED WINNOW allows negative w
ki
weights, while in the perceptron and in
POSITIVE WINNOW the w
ki
weights are al-
ways positive. In experiments conducted
by Dagan et al. [1997], POSITIVE WINNOW
showed a better effectiveness than per-
ceptron but was in turn outperformed by
(Dagan et al.’s own version of) BALANCED
WINNOW.
Other on-line methods for building text
classifiers are WIDROW-HOFF, a refinement
of it called EXPONENTIATED GRADIENT (both
applied for the first time to TC in [Lewis
et al. 1996]) and SLEEPING EXPERTS [Cohen
and Singer 1999], a version of BALANCED
WINNOW. While the first is an additive
weight-updating algorithm, the second
and third are multiplicative. Key differ-
ences with the previously described al-
gorithms are that these three algorithms
(i) update the classifier not only after mis-
classifying a training example, but also af-
ter classifying it correctly, and (ii) update
the weights corresponding to all terms (in-
stead of just active ones).
Linear classifiers lend themselves to
both category-pivoted and document-
pivoted TC. For the former the classifier
¯ c
i
is used, in a standard search engine,
as a query against the set of test docu-
ments, while for the latter the vector
¯
d
j
representing the test document is used
as a query against the set of classifiers
{¯ c
1
, . . . , ¯ c
[C[
}.
6.7. The Rocchio Method
Some linear classifiers consist of an ex-
plicit profile (or prototypical document)
of the category. This has obvious advan-
tages in terms of interpretability, as such
a profile is more readily understandable
by a human than, say, a neural network
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
26 Sebastiani
classifier. Learning a linear classifier is of-
ten preceded by local TSR; in this case, a
profile of c
i
is a weighted list of the terms
whose presence or absence is most useful
for discriminating c
i
.
The Rocchio method is used for induc-
ing linear, profile-style classifiers. It re-
lies on an adaptation to TC of the well-
known Rocchio’s formula for relevance
feedback in the vector-space model, and
it is perhaps the only TC method rooted
in the IR tradition rather than in the
ML one. This adaptation was first pro-
posed by Hull [1994], and has been used
by many authors since then, either as
an object of research in its own right
[Ittner et al. 1995; Joachims 1997; Sable
and Hatzivassiloglou 2000; Schapire et al.
1998; Singhal et al. 1997], or as a base-
line classifier [Cohen and Singer 1999;
Galavotti et al. 2000; Joachims 1998;
Lewis et al. 1996; Schapire and Singer
2000; Sch¨ utze et al. 1995], or as a mem-
ber of a classifier committee [Larkey and
Croft 1996] (see Section 6.11).
Rocchio’s method computes a classi-
fier ¯ c
i
=!w
1i
, . . . , w
[T [i
) for category c
i
by
means of the formula
w
ki
= β ·
¸
{d
j
∈POS
i
}
w
kj
[POS
i
[

γ ·
¸
{d
j
∈NEG
i
}
w
kj
[NEG
i
[
,
where w
kj
is the weight of t
k
in document
d
j
, POS
i
= {d
j
∈ Tr [
˘
(d
j
, c
i
) = T}, and
NEG
i
= {d
j
∈ Tr [
˘
(d
j
, c
i
) = F}. In this
formula, β and γ are control parameters
that allow setting the relative importance
of positive and negative examples. For
instance, if β is set to 1 and γ to 0 (as
in Dumais et al. [1998]; Hull [1994];
Joachims [1998]; Sch¨ utze et al. [1995]),
the profile of c
i
is the centroid of its pos-
itive training examples. A classifier built
by means of the Rocchio method rewards
the closeness of a test document to the
centroid of the positive training examples,
and its distance from the centroid of the
negative training examples. The role of
negative examples is usually deempha-
sized, by setting β to a high value and γ to
a low one (e.g., Cohen and Singer [1999],
Ittner et al. [1995], and Joachims [1997]
use β = 16 and γ = 4).
This method is quite easy to implement,
and is also quite efficient, since learning
a classifier basically comes down to aver-
aging weights. In terms of effectiveness,
instead, a drawback is that if the docu-
ments in the category tend to occur in
disjoint clusters (e.g., a set of newspaper
articles lebeled with the Sports category
and dealing with either boxing or rock-
climbing), such a classifier may miss most
of them, as the centroid of these docu-
ments may fall outside all of these clusters
(see Figure 3(a)). More generally, a classi-
fier built by the Rocchio method, as all lin-
ear classifiers, has the disadvantage that
it divides the space of documents linearly.
This situation is graphically depicted in
Figure 3(a), where documents are classi-
fied within c
i
if and only if they fall within
the circle. Note that even most of the pos-
itive training examples would not be clas-
sified correctly by the classifier.
6.7.1. Enhancements to the Basic Rocchio
Framework. One issue in the application of
the Rocchio formula to profile extraction
is whether the set NEG
i
should be con-
sidered in its entirety, or whether a well-
chosen sample of it, such as the set NPOS
i
of near-positives (defined as “the most pos-
itive among the negative training exam-
ples”), should be selected from it, yielding
w
ki
= β ·
¸
{d
j
∈POS
i
}
w
kj
[POS
i
[

γ ·
¸
{d
j
∈NPOS
i
}
w
kj
[NPOS
i
[
.
The
¸
{d
j
∈NPOS
i
}
w
kj
[NPOS
i
[
factor is more sig-
nificant than
¸
{d
j
∈NEG
i
}
w
kj
[NEG
i
[
, since near-
positives are the most difficult documents
to tell apart from the positives. Using
near-positives corresponds to the query
zoning method proposed for IR by Singhal
et al. [1997]. This method originates from
the observation that, when the original
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 27
Fig. 3. A comparison between the TC behavior of (a) the Rocchio classifier, and
(b) the k-NN classifier. Small crosses and circles denote positive and negative
training instances, respectively. The big circles denote the “influence area” of
the classifier. Note that, for ease of illustration, document similarities are here
viewed in terms of Euclidean distance rather than, as is more common, in terms
of dot product or cosine.
Rocchio formula is used for relevance
feedback in IR, near-positives tend to
be used rather than generic negatives, as
the documents on which user judgments
are available are among the ones that
had scored highest in the previous rank-
ing. Early applications of the Rocchio for-
mula to TC (e.g., Hull [1994]; Ittner et al.
[1995]) generally did not make a distinc-
tion between near-positives and generic
negatives. In order to select the near-
positives Schapire et al. [1998] issue a
query, consisting of the centroid of the pos-
itive training examples, against a docu-
ment base consisting of the negative train-
ing examples; the top-ranked ones are the
most similar to this centroid, and are then
the near-positives. Wiener et al. [1995] in-
stead equate the near-positives of c
i
to
the positive examples of the sibling cate-
gories of c
i
, as in the application they work
on (TC with hierarchically organized cat-
egory sets) the notion of a “sibling cate-
gory of c
i
” is well defined. A similar policy
is also adopted by Ng et al. [1997], Ruiz
and Srinivasan [1999], and Weigend et al.
[1999].
By using query zoning plus other en-
hancements (TSR, statistical phrases, and
a method called dynamic feedback op-
timization), Schapire et al. [1998] have
found that a Rocchio classifier can achieve
an effectiveness comparable to that of
a state-of-the-art ML method such as
“boosting” (see Section 6.11.1) while being
60 times quicker to train. These recent
results will no doubt bring about a re-
newed interest for the Rocchio classifier,
previously considered an underperformer
[Cohen and Singer 1999; Joachims 1998;
Lewis et al. 1996; Sch¨ utze et al. 1995; Yang
1999].
6.8. Neural Networks
A neural network (NN) text classifier is a
network of units, where the input units
represent terms, the output unit(s) repre-
sent the category or categories of interest,
and the weights on the edges connecting
units represent dependence relations. For
classifying a test document d
j
, its term
weights w
kj
are loadedinto the input units;
the activation of these units is propa-
gated forward through the network, and
the value of the output unit(s) determines
the categorization decision(s). A typical
way of training NNs is backpropagation,
whereby the term weights of a training
document are loaded into the input units,
and if a misclassification occurs the error
is “backpropagated” so as to change the pa-
rameters of the network and eliminate or
minimize the error.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
28 Sebastiani
The simplest type of NN classifier is
the perceptron [Dagan et al. 1997; Ng
et al. 1997], which is a linear classifier and
as such has been extensively discussed
in Section 6.6. Other types of linear NN
classifiers implementing a form of logis-
tic regression have also been proposed
and tested by Sch¨ utze et al. [1995] and
Wiener et al. [1995], yielding very good
effectiveness.
A nonlinear NN [Lam and Lee 1999;
Ruiz and Srinivasan 1999; Sch¨ utze et al.
1995; Weigend et al. 1999; Wiener et al.
1995; Yang and Liu 1999] is instead a net-
work with one or more additional “layers”
of units, which in TC usually represent
higher-order interactions between terms
that the network is able to learn. When
comparative experiments relating nonlin-
ear NNs to their linear counterparts have
been performed, the former have yielded
either no improvement [Sch¨ utze et al.
1995] or very small improvements [Wiener
et al. 1995] over the latter.
6.9. Example-Based Classifiers
Example-based classifiers do not build an
explicit, declarative representation of the
category c
i
, but rely on the category la-
bels attached to the training documents
similar to the test document. These meth-
ods have thus been called lazy learners,
since “they defer the decision on how to
generalize beyond the training data until
each new query instance is encountered”
[Mitchell 1996, page 244].
The first application of example-based
methods (a.k.a. memory-based reason-
ing methods) to TC is due to Creecy,
Masand and colleagues [Creecy et al.
1992; Masand et al. 1992]; other examples
include Joachims [1998], Lamet al. [1999],
Larkey [1998], Larkey [1999], Li and Jain
[1998], Yang and Pedersen [1997], and
Yang and Liu [1999]. Our presentation of
the example-based approach will be based
on the k-NN (for “k nearest neighbors”)
algorithm used by Yang [1994]. For decid-
ing whether d
j
∈c
i
, k-NNlooks at whether
the k training documents most similar to
d
j
also are in c
i
; if the answer is posi-
tive for a large enough proportion of them,
a positive decision is taken, and a nega-
tive decision is taken otherwise. Actually,
Yang’s is a distance-weighted version of
k-NN (see [Mitchell 1996, Section 8.2.1]),
since the fact that a most similar docu-
ment is in c
i
is weighted by its similar-
ity with the test document. Classifying d
j
by means of k-NN thus comes down to
computing
CSV
i
(d
j
)
=
¸
d
z
∈ Tr
k
(d
j
)
RSV(d
j
, d
z
) · [[
˘
(d
z
, c
i
)]],
(9)
where Tr
k
(d
j
) is the set of the k documents
d
z
which maximize RSV(d
j
, d
z
) and
[[α]] =

1 if α = T
0 if α = F
.
The thresholding methods of Section 6.1
can then be used to convert the real-
valued CSV
i
’s into binary categorization
decisions. In (9), RSV(d
j
, d
z
) represents
some measure or semantic relatedness be-
tween a test document d
j
and a training
document d
z
; any matching function, be it
probabilistic (as used by Larkey and Croft
[1996]) or vector-based (as used by Yang
[1994]), from a ranked IR system may be
used for this purpose. The construction of
a k-NN classifier also involves determin-
ing (experimentally, on a validation set) a
threshold k that indicates how many top-
ranked training documents have to be con-
sidered for computing CSV
i
(d
j
). Larkey
and Croft [1996] used k = 20, while Yang
[1994, 1999] has found 30 ≤k ≤45 to yield
the best effectiveness. Anyhow, various ex-
periments have shown that increasing the
value of k does not significantly degrade
the performance.
Note that k-NN, unlike linear classi-
fiers, does not divide the document space
linearly, and hence does not suffer from
the problem discussed at the end of
Section 6.7. This is graphically depicted
in Figure 3(b), where the more “local”
character of k-NN with respect to Rocchio
can be appreciated.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 29
This method is naturally geared toward
document-pivoted TC, since ranking the
training documents for their similarity
with the test document can be done once
for all categories. For category-pivoted TC,
one would need to store the document
ranks for each test document, which is ob-
viously clumsy; DPC is thus de facto the
only reasonable way to use k-NN.
A number of different experiments (see
Section 7.3) have shown k-NN to be quite
effective. However, its most important
drawback is its inefficiency at classifica-
tion time: while, for example, with a lin-
ear classifier only a dot product needs to
be computed to classify a test document,
k-NN requires the entire training set to
be ranked for similarity withthe test docu-
ment, which is much more expensive. This
is a drawback of “lazy” learning methods,
since they do not have a true training
phase and thus defer all the computation
to classification time.
6.9.1. Other Example-Based Techniques.
Various example-based techniques have
been used in the TC literature. For exam-
ple, Cohen and Hirsh [1998] implemented
an example-based classifier by extending
standard relational DBMS technology
with “similarity-based soft joins.” In
their WHIRL system they used the scoring
function
CSV
i
(d
j
)
= 1 −
¸
d
z
∈Tr
k
(d
j
)
(1 −RSV(d
j
, d
z
))
[[
˘
(d
z
,c
i
)]]
as an alternative to (9), obtaining a small
but statistically significant improvement
over a version of WHIRL using (9). In
their experiments this technique outper-
formed a number of other classifiers, such
as a C4.5 decision tree classifier and the
RIPPER CNF rule-based classifier.
A variant of the basic k-NN ap-
proach was proposed by Galavotti et al.
[2000], who reinterpreted (9) by redefining
[[α]] as
[[α]] =

1 if α = T
−1 if α = F
.
The difference from the original k-NN ap-
proach is that if a training document d
z
similar to the test document d
j
does not
belong to c
i
, this information is not dis-
carded but weights negatively in the deci-
sion to classify d
j
under c
i
.
A combination of profile- and example-
based methods was presented in Lam and
Ho [1998]. Inthis work a k-NNsystemwas
fed generalized instances (GIs) in place of
training documents. This approachmay be
seen as the result of
—clustering the training set, thus obtain-
ing a set of clusters K
i
= {k
i1
, . . . ,
k
i[K
i
[
};
—building a profile G(k
iz
) (“generalized
instance”) from the documents belong-
ing to cluster k
iz
by means of some algo-
rithmfor learning linear classifiers (e.g.,
Rocchio, WIDROW-HOFF);
—applying k-NN with profiles in place of
training documents, that is, computing
CSV
i
(d
j
)
def
=
¸
k
iz
∈K
i
RSV(d
j
, G(k
iz
)) ·
[{d
j
∈ k
iz
[
˘
(d
j
, c
i
) = T}[
[{d
j
∈ k
iz
}[
·
[{d
j
∈ k
iz
}[
[Tr[
=
¸
k
iz
∈K
i
RSV(d
j
, G(k
iz
)) ·
[{d
j
∈ k
iz
[
˘
(d
j
, c
i
) = T}[
[Tr[
, (10)
where
[{d
j
∈k
iz
[
˘
(d
j
,c
i
)=T}[
[{d
j
∈k
iz
}[
represents the
“degree” to which G(k
iz
) is a positive in-
stance of c
i
, and
[{d
j
∈k
iz
}[
[Tr[
represents its
weight within the entire process.
This exploits the superior effectiveness
(see Figure 3) of k-NN over linear clas-
sifiers while at the same time avoiding
the sensitivity of k-NN to the presence of
“outliers” (i.e., positive instances of c
i
that
“lie out” of the region where most other
positive instances of c
i
are located) in the
training set.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
30 Sebastiani
Fig. 4. Learning support vector classifiers.
The small crosses and circles represent posi-
tive and negative training examples, respec-
tively, whereas lines represent decision sur-
faces. Decision surface σ
i
(indicated by the
thicker line) is, among those shown, the best
possible one, as it is the middle element of
the widest set of parallel decision surfaces
(i.e., its minimum distance to any training
example is maximum). Small boxes indicate
the support vectors.
6.10. Building Classifiers by Support
Vector Machines
The support vector machine (SVM) method
has been introduced in TC by Joachims
[1998, 1999] and subsequently used by
Drucker et al. [1999], Dumais et al. [1998],
Dumais and Chen [2000], Klinkenberg
and Joachims [2000], Taira and Haruno
[1999], and Yang and Liu [1999]. In ge-
ometrical terms, it may be seen as the
attempt to find, among all the surfaces
σ
1
, σ
2
, . . . in [T [-dimensional space that
separate the positive from the negative
training examples (decision surfaces), the
σ
i
that separates the positives from the
negatives by the widest possible margin,
that is, such that the separation property
is invariant with respect to the widest pos-
sible traslation of σ
i
.
This idea is best understood in the case
in which the positives and the negatives
are linearly separable, in which case the
decisionsurfaces are ([T [−1)-hyperplanes.
In the two-dimensional case of Figure 4,
various lines may be chosen as decision
surfaces. The SVM method chooses the
middle element from the “widest” set of
parallel lines, that is, fromthe set in which
the maximum distance between two ele-
ments in the set is highest. It is notewor-
thy that this “best” decision surface is de-
termined by only a small set of training
examples, called the support vectors.
The method described is applicable also
to the case in which the positives and the
negatives are not linearly separable. Yang
and Liu [1999] experimentally compared
the linear case (namely, when the assump-
tion is made that the categories are lin-
early separable) with the nonlinear case
on a standard benchmark, and obtained
slightly better results in the former case.
As argued by Joachims [1998], SVMs
offer two important advantages for TC:
—term selection is often not needed, as
SVMs tend to be fairly robust to over-
fitting and can scale up to considerable
dimensionalities;
—no human and machine effort in param-
eter tuning ona validationset is needed,
as there is a theoretically motivated,
“default” choice of parameter settings,
which has also been shown to provide
the best effectiveness.
Dumais et al. [1998] have applied a
novel algorithm for training SVMs which
brings about training speeds comparable
to computationally easy learners such as
Rocchio.
6.11. Classifier Committees
Classifier committees (a.k.a. ensembles)
are based on the idea that, given a task
that requires expert knowledge to per-
form, k experts may be better than one if
their individual judgments are appropri-
ately combined. In TC, the idea is to ap-
ply k different classifiers
1
, . . . ,
k
to the
same task of deciding whether d
j
∈ c
i
, and
then combine their outcome appropriately.
A classifier committee is then character-
ized by (i) a choice of k classifiers, and (ii)
a choice of a combination function.
Concerning Issue (i), it is known from
the ML literature that, in order to guar-
antee good effectiveness, the classifiers
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 31
forming the committee should be as in-
dependent as possible [Tumer and Ghosh
1996]. The classifiers may differ for the in-
dexing approach used, or for the inductive
method, or both. Within TC, the avenue
which has been explored most is the latter
(to our knowledge the only example of the
former is Scott and Matwin [1999]).
Concerning Issue (ii), various rules have
been tested. The simplest one is majority
voting (MV), whereby the binary outputs
of the k classifiers are pooled together, and
the classification decision that reaches the
majority of
k÷1
2
votes is taken (k obviously
needs to be an odd number) [Li and Jain
1998; Liere and Tadepalli 1997]. This
method is particularly suited to the case
in which the committee includes classi-
fiers characterized by a binary decision
function CSV
i
: D →{T, F}. Asecond rule
is weighted linear combination (WLC),
whereby a weighted sumof the CSV
i
’s pro-
duced by the k classifiers yields the final
CSV
i
. The weights w
j
reflect the expected
relative effectiveness of classifiers
j
, and
are usually optimized on a validation set
[Larkey and Croft 1996]. Another policy
is dynamic classifier selection (DCS),
whereby among committee {
1
, . . . ,
k
}
the classifier
t
most effective on the l
validation examples most similar to d
j
is selected, and its judgment adopted by
the committee [Li and Jain 1998]. A still
different policy, somehow intermediate
between WLC and DCS, is adaptive
classifier combination (ACC), whereby the
judgments of all the classifiers in the com-
mittee are summed together, but their in-
dividual contribution is weighted by their
effectiveness on the l validation examples
most similar to d
j
[Li and Jain 1998].
Classifier committees have had mixed
results in TC so far. Larkey and Croft
[1996] have used combinations of Rocchio,
Na¨ıve Bayes, and k-NN, all together or in
pairwise combinations, using a WLC rule.
In their experiments the combination of
any two classifiers outperformed the best
individual classifier (k-NN), and the com-
bination of the three classifiers improved
an all three pairwise combinations. These
results would seem to give strong sup-
port to the idea that classifier committees
can somehow profit from the complemen-
tary strengths of their individual mem-
bers. However, the small size of the test set
used (187 documents) suggests that more
experimentation is needed before conclu-
sions can be drawn.
Li and Jain[1998] have tested a commit-
tee formed of (various combinations of) a
Na¨ıve Bayes classifier, an example-based
classifier, a decision tree classifier, and a
classifier built by means of their own “sub-
space method”; the combination rules they
have worked with are MV, DCS, and ACC.
Only in the case of a committee formed
by Na¨ıve Bayes and the subspace classi-
fier combined by means of ACC has the
committee outperformed, and by a nar-
row margin, the best individual classifier
(for every attempted classifier combina-
tion ACC gave better results than MV and
DCS). This seems discouraging, especially
in light of the fact that the committee ap-
proach is computationally expensive (its
cost trivially amounts to the sum of the
costs of the individual classifiers plus
the cost incurred for the computation of
the combination rule). Again, it has to be
remarked that the small size of their ex-
periment (two test sets of less than 700
documents each were used) does not allow
us to draw definitive conclusions on the
approaches adopted.
6.11.1. Boosting. The boosting method
[Schapire et al. 1998; Schapire and Singer
2000] occupies a special place in the classi-
fier committees literature, since the k clas-
sifiers
1
, . . . ,
k
forming the committee
are obtained by the same learning method
(here called the weak learner). The key
intuition of boosting is that the k clas-
sifiers should be trained not in a con-
ceptually parallel and independent way,
as in the committees described above,
but sequentially. In this way, in train-
ing classifier
i
one may take into ac-
count how classifiers
1
, . . . ,
i−1
perform
on the training examples, and concentrate
on getting right those examples on which

1
, . . . ,
i−1
have performed worst.
Specifically, for learning classifier
t
each !d
j
, c
i
) pair is given an “importance
weight” h
t
i j
(where h
1
i j
is set to be equal for
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
32 Sebastiani
all !d
j
, c
i
) pairs
15
), which represents how
hard to get a correct decision for this
pair was for classifiers
1
, . . . ,
t−1
. These
weights are exploited in learning
t
,
which will be specially tuned to correctly
solve the pairs with higher weight. Clas-
sifier
t
is then applied to the training
documents, and as a result weights h
t
i j
are updated to h
t÷1
i j
; in this update oper-
ation, pairs correctly classified by
t
will
have their weight decreased, while pairs
misclassified by
t
will have their weight
increased. After all the k classifiers have
been built, a weighted linear combination
rule is applied to yield the final committee.
Inthe BOOSTEXTER system[Schapire and
Singer 2000], two different boosting al-
gorithms are tested, using a one-level
decision tree weak learner. The former
algorithm (ADABOOST.MH, simply called
ADABOOST in Schapire et al. [1998]) is ex-
plicitly geared toward the maximization of
microaveraged effectiveness, whereas the
latter (ADABOOST.MR) is aimed at mini-
mizing ranking loss (i.e., at getting a cor-
rect category ranking for each individual
document). Inexperiments conducted over
three different test collections, Schapire
et al. [1998] have shown ADABOOST to
outperform SLEEPING EXPERTS, a classifier
that had proven quite effective in the ex-
periments of Cohen and Singer [1999].
Further experiments by Schapire and
Singer [2000] showed ADABOOST to out-
perform, aside from SLEEPING EXPERTS, a
Na¨ıve Bayes classifier, a standard (nonen-
hanced) Rocchio classifier, and Joachims’
[1997] PRTFIDF classifier.
A boosting algorithm based on a “com-
mittee of classifier subcommittees” that
improves on the effectiveness and (espe-
cially) the efficiency of ADABOOST.MH was
presented in Sebastiani et al. [2000]. An
approach similar to boosting was also em-
ployed by Weiss et al. [1999], who experi-
mented with committees of decision trees
each having an average of 16 leaves (and
hence much more complex than the sim-
15
Schapire et al. [1998] also showed that a simple
modification of this policy allows optimization of the
classifier based on “utility” (see Section 7.1.3).
ple “decision stumps” used in Schapire
and Singer [2000]), eventually combined
by using the simple MV rule as a combi-
nation rule; similarly to boosting, a mech-
anism for emphasising documents that
have been misclassified by previous de-
cision trees is used. Boosting-based ap-
proaches have also been employed in
Escudero et al. [2000], Iyer et al. [2000],
Kim et al. [2000], Li and Jain [1998], and
Myers et al. [2000].
6.12. Other Methods
Although in the previous sections we
have tried to give an overview as com-
plete as possible of the learning ap-
proaches proposed in the TC literature, it
is hardly possible to be exhaustive. Some
of the learning approaches adopted do
not fall squarely under one or the other
class of algorithms, or have remained
somehow isolated attempts. Among these,
the most noteworthy are the ones based
on Bayesian inference networks [Dumais
et al. 1998; Lam et al. 1997; Tzeras
and Hartmann 1993], genetic algorithms
[Clack et al. 1997; Masand 1994], and
maximum entropy modelling [Manning
and Sch¨ utze 1999].
7. EVALUATION OF TEXT CLASSIFIERS
As for text search systems, the eval-
uation of document classifiers is typ-
ically conducted experimentally, rather
than analytically. The reason is that, in
order to evaluate a system analytically
(e.g., proving that the system is correct
and complete), we would need a formal
specification of the problem that the sys-
tem is trying to solve (e.g., with respect
to what correctness and completeness are
defined), and the central notion of TC
(namely, that of membership of a docu-
ment in a category) is, due to its subjective
character, inherently nonformalizable.
The experimental evaluation of a clas-
sifier usually measures its effectiveness
(rather than its efficiency), that is, its
ability to take the right classification
decisions.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 33
Table II. The Contingency Table for Category c
i
Category Expert judgments
c
i YES NO
Classifier YES TP
i
FP
i
Judgments NO FN
i
TN
i
7.1. Measures of Text
Categorization Effectiveness
7.1.1. Precision and Recall. Classification
effectiveness is usually measured in terms
of the classic IR notions of precision (π)
and recall (ρ), adapted to the case of
TC. Precision wrt c
i

i
) is defined as
the conditional probability P(
˘
(d
x
, c
i
) =
T [ (d
x
, c
i
) =T), that is, as the prob-
ability that if a random document d
x
is
classified under c
i
, this decision is correct.
Analogously, recall wrt c
i

i
) is defined
as P((d
x
, c
i
) = T [
˘
(d
x
, c
i
) = T), that
is, as the probability that, if a random
document d
x
ought to be classified under
c
i
, this decision is taken. These category-
relative values may be averaged, in a way
to be discussed shortly, to obtain π and ρ,
that is, values global to the entire category
set. Borrowing terminology from logic, π
may be viewed as the “degree of sound-
ness” of the classifier wrt C, while ρ may
be viewed as its “degree of completeness”
wrt C. As defined here, π
i
and ρ
i
are to
be understood as subjective probabilities,
that is, as measuring the expectation of
the user that the system will behave cor-
rectly when classifying an unseen docu-
ment under c
i
. These probabilities may
be estimated in terms of the contingency
table for c
i
ona giventest set (see Table II).
Here, FP
i
(false positives wrt c
i
, a.k.a.
errors of commission) is the number of
test documents incorrectly classified un-
der c
i
; TN
i
(true negatives wrt c
i
), TP
i
(true
positives wrt c
i
), and FN
i
(false negatives
wrt c
i
, a.k.a. errors of omission) are de-
fined accordingly. Estimates (indicated by
carets) of precision wrt c
i
and recall wrt c
i
may thus be obtained as
ˆ π
i
=
TP
i
TP
i
÷FP
i
, ˆ ρ
i
=
TP
i
TP
i
÷FN
i
.
For obtaining estimates of π and ρ, two
different methods may be adopted:
Table III. The Global Contingency Table
Category set Expert judgments
C = {c
1
, . . . , c
[C[
} YES NO
Classifier YES TP =
[C[
¸
i=1
TP
i
FP =
[C[
¸
i=1
FP
i
Judgments NO FN =
[C[
¸
i=1
FN
i
TN =
[C[
¸
i=1
TN
i
—microaveraging: π andρ are obtainedby
summing over all individual decisions:
ˆ π
µ
=
TP
TP ÷FP
=
¸
[C[
i=1
TP
i
¸
[C[
i=1
(TP
i
÷FP
i
)
,
ˆ ρ
µ
=
TP
TP ÷FN
=
¸
[C[
i=1
TP
i
¸
[C[
i=1
(TP
i
÷FN
i
)
,
where “µ” indicates microaverag-
ing. The “global” contingency table
(Table III) is thus obtained by sum-
ming over category-specific contin-
gency tables;
—macroaveraging: precision and recall
are first evaluated “locally” for each
category, and then “globally” by aver-
aging over the results of the different
categories:
ˆ π
M
=
¸
[C[
i=1
ˆ π
i
[C[
, ˆ ρ
M
=
¸
[C[
i=1
ˆ ρ
i
[C[
,
where “M” indicates macroaveraging.
These two methods may give quite dif-
ferent results, especially if the different
categories have very different generality.
For instance, the ability of a classifier to
behave well also on categories with low
generality (i.e., categories with few pos-
itive training instances) will be empha-
sized by macroaveraging and much less
so by microaveraging. Whether one or the
other should be used obviously depends on
the application requirements. From now
on, we will assume that microaveraging is
used; everything we will say in the rest of
Section 7 may be adapted to the case of
macroaveraging in the obvious way.
7.1.2. Other Measures of Effectiveness.
Measures alternative to π and ρ and
commonly used in the ML litera-
ture, such as accuracy (estimated as
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
34 Sebastiani
ˆ
A=
TP÷TN
TP÷TN÷FP÷FN
) and error (estimated
as
ˆ
E =
FP÷FN
TP÷TN÷FP÷FN
=1 −
ˆ
A), are not
widely used in TC. The reason is that, as
Yang [1999] pointed out, the large value
that their denominator typically has in
TC makes them much more insensitive to
variations in the number of correct deci-
sions (TP÷TN) than π and ρ. Besides, if
A is the adopted evaluation measure, in
the frequent case of a very low average
generality the trivial rejector (i.e., the
classifier such that (d
j
, c
i
) = F for
all d
j
and c
i
) tends to outperform all
nontrivial classifiers (see also Cohen
[1995a], Section 2.3). If A is adopted,
parameter tuning on a validation set may
thus result in parameter choices that
make the classifier behave very much like
the trivial rejector.
A nonstandard effectiveness mea-
sure was proposed by Sable and
Hatzivassiloglou [2000, Section 7], who
suggested basing π and ρ not on “abso-
lute” values of success and failure (i.e., 1
if (d
j
, c
i
) =
˘
(d
j
, c
i
) and 0 if (d
j
, c
i
) ,=
˘
(d
j
, c
i
)), but on values of relative suc-
cess (i.e., CSV
i
(d
j
) if
˘
(d
j
, c
i
) =T and
1 −CSV
i
(d
j
) if
˘
(d
j
, c
i
) = F). This means
that for a correct (respectively wrong)
decision the classifier is rewarded (re-
spectively penalized) proportionally to its
confidence in the decision. This proposed
measure does not reward the choice of a
good thresholding policy, and is thus unfit
for autonomous (“hard”) classification
systems. However, it might be appropri-
ate for interactive (“ranking”) classifiers
of the type used in Larkey [1999], where
the confidence that the classifier has
in its own decision influences category
ranking and, as a consequence, the overall
usefulness of the system.
7.1.3. Measures Alternative to Effectiveness.
In general, criteria different from effec-
tiveness are seldomused in classifier eval-
uation. For instance, efficiency, although
very important for applicative purposes,
is seldom used as the sole yardstick, due
to the volatility of the parameters on
which the evaluation rests. However, ef-
ficiency may be useful for choosing among
Table IV. The Utility Matrix
Category set Expert judgments
C = {c
1
, . . . , c
[C[
} YES NO
Classifier YES u
TP
u
FP
Judgments NO u
FN
u
TN
classifiers with similar effectiveness. An
interesting evaluation has been carried
out by Dumais et al. [1998], who have
compared five different learning methods
along three different dimensions, namely,
effectiveness, training efficiency (i.e., the
average time it takes to build a classifier
for category c
i
from a training set Tr), and
classification efficiency (i.e., the average
time it takes to classify a new document
d
j
under category c
i
).
An important alternative to effective-
ness is utility, a class of measures from
decision theory that extend effectiveness
by economic criteria such as gain or loss.
Utility is based on a utility matrix such
as that of Table IV, where the numeric
values u
TP
, u
FP
, u
FN
and u
TN
represent
the gain brought about by a true positive,
false positive, false negative, and true neg-
ative, respectively; both u
TP
and u
TN
are
greater than both u
FP
and u
FN
. “Standard”
effectiveness is a special case of utility,
i.e., the one in which u
TP
=u
TN
>u
FP
=
u
FN
. Less trivial cases are those in
which u
TP
,=u
TN
and/or u
FP
,=u
FN
; this
is appropriate, for example, in spam fil-
tering, where failing to discard a piece
of junk mail (FP) is a less serious mis-
take than discarding a legitimate mes-
sage (FN) [Androutsopoulos et al. 2000].
If the classifier outputs probability esti-
mates of the membership of d
j
in c
i
, then
decision theory provides analytical meth-
ods to determine thresholds τ
i
, thus avoid-
ing the need to determine them exper-
imentally (as discussed in Section 6.1).
Specifically, as Lewis [1995a] reminds us,
the expected value of utility is maximized
when
τ
i
=
(u
FP
−u
TN
)
(u
FN
−u
TP
) ÷(u
FP
−u
TN
)
,
which, in the case of “standard” effective-
ness, is equal to
1
2
.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 35
Table V. Trivial Cases in TC
Precision Recall C-precision C-recall
TP
TP ÷FP
TP
TP ÷FN
TN
FP ÷TN
TN
TN÷FN
Trivial rejector TP=FP=0 Undefined
0
FN
=0
TN
TN
=1
TN
TN÷FN
Trivial acceptor FN=TN=0
TP
TP ÷FP
TP
TP
=1
0
FP
=0 Undefined
Trivial “Yes” collection FP=TN=0
TP
TP
=1
TP
TP ÷FN
Undefined
0
FN
=0
Trivial “No” collection TP=FN=0
0
FP
=0 Undefined
TN
FP ÷TN
TN
TN
=1
The use of utility in TC is discussed
in detail by Lewis [1955a]. Other works
where utility is employed are Amati and
Crestani [1999], Cohen and Singer [1999],
Hull et al. [1996], Lewis and Catlett
[1994], and Schapire et al. [1998]. Utility
has become popular within the text filter-
ing community, and the TREC “filtering
track” evaluations have been using it for
a while [Lewis 1995c]. The values of the
utility matrix are extremely application-
dependent. This means that if utility is
used instead of “pure” effectiveness, there
is a further element of difficulty in the
cross-comparison of classification systems
(see Section 7.3), since for two classifiers
to be experimentally comparable also the
two utility matrices must be the same.
Other effectiveness measures different
from the ones discussed here have occa-
sionally been used in the literature; these
include adjacent score [Larkey 1998],
coverage [Schapire and Singer 2000], one-
error [Schapire and Singer 2000], Pear-
son product-moment correlation [Larkey
1998], recall at n [Larkey and Croft 1996],
top candidate [Larkey and Croft 1996],
and top n [Larkey and Croft 1996]. We
will not attempt to discuss them in detail.
However, their use shows that, although
the TC community is making consistent
efforts at standardizing experimentation
protocols, we are still far from universal
agreement on evaluation issues and, as
a consequence, from understanding pre-
cisely the relative merits of the various
methods.
7.1.4. Combined Effectiveness Measures.
Neither precision nor recall makes sense
in isolation from each other. In fact the
classifier such that (d
j
, c
i
) = T for all
d
j
and c
i
(the trivial acceptor) has ρ = 1.
When the CSV
i
function has values in
[0, 1], one only needs to set every thresh-
old τ
i
to 0 to obtain the trivial acceptor.
In this case, π would usually be very low
(more precisely, equal to the average test
set generality
¸
[C[
i=1
g
Te
(c
i
)
[C[
).
16
Conversely, it
is well known from everyday IR practice
that higher levels of π may be obtained at
the price of low values of ρ.
In practice, by tuning τ
i
a function
CSV
i
: D → {T, F} is tuned to be, in the
words of Riloff and Lehnert [1994], more
liberal (i.e., improving ρ
i
to the detriment
of π
i
) or more conservative (improving π
i
to
16
From this, one might be tempted to infer, by sym-
metry, that the trivial rejector always has π = 1.
This is false, as π is undefined (the denominator is
zero) for the trivial rejector (see Table V). In fact,
it is clear from its definition (π =
TP
TP÷FP
) that π
depends only on how the positives (TP ÷ FP) are
split between true positives TP and the false posi-
tives FP, and does not depend at all on the cardinal-
ity of the positives. There is a breakup of “symme-
try” between π and ρ here because, from the point of
view of classifier judgment (positives vs. negatives;
this is the dichotomy of interest in trivial acceptor vs.
trivial rejector), the “symmetric” of ρ (
TP
TP÷FN
) is not
π (
TP
TP÷FP
) but C-precision (π
c
=
TN
FP÷TN
), the “con-
trapositive” of π. In fact, while ρ =1 and π
c
=0 for
the trivial acceptor, π
c
=1 and ρ = 0 for the trivial
rejector.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
36 Sebastiani
the detriment of ρ
i
).
17
A classifier should
thus be evaluated by means of a mea-
sure which combines π and ρ.
18
Vari-
ous such measures have been proposed,
among which the most frequent are:
(1) Eleven-point average precision: thresh-
old τ
i
is repeatedly tuned so as to allow
ρ
i
to take up values of 0.0, .1, . . . , .9,
1.0; π
i
is computed for these 11 differ-
ent values of τ
i
, and averaged over the
11 resulting values. This is analogous
to the standard evaluation methodol-
ogy for ranked IR systems, and may be
used
(a) with categories in place of IR
queries. This is most frequently
used for document-ranking clas-
sifiers (see Sch¨ utze et al. [1995];
Yang [1994]; Yang [1999]; Yang and
Pedersen [1997]);
(b) with test documents in place of
IR queries and categories in place
of documents. This is most fre-
quently used for category-ranking
classifiers (see Lam et al. [1999];
Larkey and Croft [1996]; Schapire
and Singer [2000]; Wiener et al.
[1995]). In this case, if macroav-
eraging is used, it needs to be re-
defined on a per-document, rather
than per-category, basis.
This measure does not make sense for
binary-valued CSV
i
functions, since in
this case ρ
i
may not be varied at will.
(2) The breakeven point, that is, the
value at which π equals ρ (e.g., Apt´ e
et al. [1994]; Cohen and Singer [1999];
Dagan et al. [1997]; Joachims [1998];
17
While ρ
i
can always be increased at will by low-
ering τ
i
, usually at the cost of decreasing π
i
, π
i
can
usually be increased at will by raising τ
i
, always at
the cost of decreasing ρ
i
. This kind of tuning is only
possible for CSV
i
functions with values in [0, 1]; for
binary-valued CSV
i
functions tuning is not always
possible, or is anyway more difficult (see Weiss et al.
[1999], page 66).
18
An exception is single-label TC, in which π and ρ
are not independent of each other: if a document d
j
has been classified under a wrong category c
s
(thus
decreasing π
s
), this also means that it has not been
classified under the right category c
t
(thus decreas-
ing ρ
t
). In this case either π or ρ can be used as a
measure of effectiveness.
Joachims [1999]; Lewis [1992a]; Lewis
and Ringuette [1994]; Moulinier and
Ganascia [1996]; Ng et al. [1997]; Yang
[1999]). This is obtained by a process
analogous to the one used for 11-point
average precision: a plot of π as a func-
tion of ρ is computed by repeatedly
varying the thresholds τ
i
; breakeven
is the value of ρ (or π) for which the
plot intersects the ρ =π line. This idea
relies on the fact that, by decreasing
the τ
i
’s from 1 to 0, ρ always increases
monotonically from 0 to 1 and π usu-
ally decreases monotonically from a
value near 1 to
1
[C[
¸
[C[
i=1
g
Te
(c
i
). If for
no values of the τ
i
’s π and ρ are ex-
actly equal, the τ
i
’s are set to the value
for which π and ρ are closest, and an
interpolated breakeven is computed as
the average of the values of π and ρ.
19
(3) The F
β
function [van Rijsbergen 1979,
Chapter 7], for some 0 ≤β ≤ ÷ ∞
(e.g., Cohen [1995a]; Cohen and Singer
[1999]; Lewis and Gale [1994]; Lewis
[1995a]; Moulinier et al. [1996]; Ruiz
and Srinivassan [1999]), where
F
β
=

2
÷1)πρ
β
2
π ÷ρ
Here β may be seen as the relative de-
gree of importance attributed to π and
ρ. If β = 0 then F
β
coincides with π,
whereas if β = ÷∞ then F
β
coincides
with ρ. Usually, a value β = 1 is used,
which attributes equal importance to
π and ρ. As shown in Moulinier et al.
[1996] and Yang [1999], the breakeven
of a classifier is always less or equal
than its F
1
value.
19
Breakeven, first proposed by Lewis [1992a, 1992b],
has been recently criticized. Lewis himself (see his
message of 11 Sep 1997 10:49:01 to the DDLBETA
text categorizationmailing list—quoted withpermis-
sion of the author) has pointed out that breakeven is
not a good effectiveness measure, since (i) there may
be no parameter setting that yields the breakeven; in
this case the final breakeven value, obtained by in-
terpolation, is artificial; (ii) to have ρ equal π is not
necessarily desirable, and it is not clear that a system
that achieves high breakeven can be tuned to score
high on other effectiveness measures. Yang [1999]
also noted that when for no value of the parameters π
and ρ are close enough, interpolated breakeven may
not be a reliable indicator of effectiveness.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 37
Once an effectiveness measure is chosen,
a classifier can be tuned (e.g., thresh-
olds and other parameters can be set)
so that the resulting effectiveness is the
best achievable by that classifier. Tun-
ing a parameter p (be it a threshold or
other) is normally done experimentally.
This means performing repeated experi-
ments on the validation set with the val-
ues of the other parameters p
k
fixed (at
a default value, in the case of a yet-to-
be-tuned parameter p
k
, or at the chosen
value, if the parameter p
k
has already
been tuned) and with different values for
parameter p. The value that has yielded
the best effectiveness is chosen for p.
7.2. Benchmarks for Text Categorization
Standard benchmark collections that can
be used as initial corpora for TC are publi-
cally available for experimental purposes.
The most widely used is the Reuters col-
lection, consisting of a set of newswire
stories classified under categories related
to economics. The Reuters collection ac-
counts for most of the experimental work
in TC so far. Unfortunately, this does not
always translate into reliable comparative
results, in the sense that many of these ex-
periments have been carried out in subtly
different conditions.
In general, different sets of experiments
may be used for cross-classifier compar-
ison only if the experiments have been
performed
(1) on exactly the same collection (i.e.,
same documents and same categories);
(2) with the same “split” between training
set and test set;
(3) with the same evaluation measure
and, whenever this measure depends
on some parameters (e.g., the utility
matrix chosen), with the same param-
eter values.
Unfortunately, a lot of experimentation,
both on Reuters and on other collec-
tions, has not been performed with these
caveats in mind: by testing three differ-
ent classifiers on five popular versions
of Reuters, Yang [1999] has shown that
a lack of compliance with these three
conditions may make the experimental
results hardly comparable among each
other. Table VI lists the results of all
experiments known to us performed on
five major versions of the Reuters bench-
mark: Reuters-22173 “ModLewis” (column
#1), Reuters-22173 “ModApt ´ e” (column #2),
Reuters-22173 “ModWiener” (column #3),
Reuters-21578 “ModApt ´ e” (column #4),
and Reuters-21578[10] “ModApt ´ e” (column
#5).
20
Only experiments that have com-
puted either a breakeven or F
1
have been
listed, since other less popular effective-
ness measures do not readily compare
with these.
Note that only results belonging to the
same column are directly comparable.
In particular, Yang [1999] showed that
experiments carried out on Reuters-22173
“ModLewis” (column #1) are not directly
comparable with those using the other
three versions, since the former strangely
includes a significant percentage (58%) of
“unlabeled” test documents which, being
negative examples of all categories, tend
to depress effectiveness. Also, experi-
ments performed on Reuters-21578[10]
“ModApt ´ e” (column #5) are not comparable
with the others, since this collection is the
restriction of Reuters-21578 “ModApt ´ e” to
the 10 categories with the highest gen-
erality, and is thus an obviously “easier”
collection.
Other test collections that have been
frequently used are
—the OHSUMED collection, set up
by Hersh et al. [1994] and used by
Joachims [1998], Lam and Ho [1998],
Lam et al. [1999], Lewis et al. [1996],
Ruiz and Srinivasan [1999], and Yang
20
The Reuters-21578 collection may be freely down-
loaded for experimentation purposes from http://
www.research.att.com/~lewis/reuters21578.html.
A new corpus, called Reuters Corpus Volume 1 and
consisting of roughly 800,000 documents, has
recently been made available by Reuters for
TC experiments (see http://about.reuters.com/
researchandstandards/corpus/). This will likely
replace Reuters-21578 as the “standard” Reuters
benchmark for TC.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
38 Sebastiani
Table VI. Comparative Results Among Different Classifiers Obtained on Five Different Versions of Reuters.
(Unless otherwise noted, entries indicate the microaveraged breakeven point; within parentheses, “M”
indicates macroaveraging and “F
1
” indicates use of the F
1
measure; boldface indicates the best
performer on the collection)
#1 #2 #3 #4 #5
# of documents 21,450 14,347 13,27212,90212,902
# of training documents 14,704 10,667 9,610 9,603 9,603
# of test documents 6,746 3,680 3,662 3,299 3,299
# of categories 135 93 92 90 10
System Type Results reported by
WORD (non-learning) Yang [1999] .150 .310 .290
probabilistic [Dumais et al. 1998] .752 .815
probabilistic [Joachims 1998] .720
probabilistic [Lam et al. 1997] .443 (MF
1
)
PROPBAYES probabilistic [Lewis 1992a] .650
BIM probabilistic [Li and Yamanishi 1999] .747
probabilistic [Li and Yamanishi 1999] .773
NB probabilistic [Yang and Liu 1999] .795
decision trees [Dumais et al. 1998] .884
C4.5 decision trees [Joachims 1998] .794
IND decision trees [Lewis and Ringuette 1994] .670
SWAP-1 decision rules [Apt´ e et al. 1994] .805
RIPPER decision rules [Cohen and Singer 1999] .683 .811 .820
SLEEPINGEXPERTS decision rules [Cohen and Singer 1999] .753 .759 .827
DL-ESC decision rules [Li and Yamanishi 1999] .820
CHARADE decision rules [Moulinier and Ganascia 1996] .738
CHARADE decision rules [Moulinier et al. 1996] .783 (F
1
)
LLSF regression [Yang 1999] .855 .810
LLSF regression [Yang and Liu 1999] .849
BALANCEDWINNOW on-line linear [Dagan et al. 1997] .747 (M) .833 (M)
WIDROW-HOFF on-line linear [Lam and Ho 1998] .822
ROCCHIO batch linear [Cohen and Singer 1999] .660 .748 .776
FINDSIM batch linear [Dumais et al. 1998] .617 .646
ROCCHIO batch linear [Joachims 1998] .799
ROCCHIO batch linear [Lam and Ho 1998] .781
ROCCHIO batch linear [Li and Yamanishi 1999] .625
CLASSI neural network [Ng et al. 1997] .802
NNET neural network Yang and Liu 1999] .838
neural network [Wiener et al. 1995] .820
GIS-W example-based [Lam and Ho 1998] .860
k-NN example-based [Joachims 1998] .823
k-NN example-based [Lam and Ho 1998] .820
k-NN example-based [Yang 1999] .690 .852 .820
k-NN example-based [Yang and Liu 1999] .856
SVM [Dumais et al. 1998] .870 .920
SVMLIGHT SVM [Joachims 1998] .864
SVMLIGHT SVM [Li Yamanishi 1999] .841
SVMLIGHT SVM [Yang and Liu 1999] .859
ADABOOST.MH committee [Schapire and Singer 2000] .860
committee [Weiss et al. 1999] .878
Bayesian net [Dumais et al. 1998] .800 .850
Bayesian net [Lam et al. 1997] .542 (MF
1
)
and Pedersen [1997].
21
The documents
are titles or title-plus-abstracts from
medical journals (OHSUMEDis actually
a subset of the Medline document base);
21
The OHSUMED collection may be freely down-
loaded for experimentation purposes from ftp://
medir.ohsu.edu/pub/ohsumed.
the categories are the “postable terms”
of the MESH thesaurus.
—the 20 Newsgroups collection, set up
by Lang [1995] and used by Baker
and McCallum [1998], Joachims
[1997], McCallum and Nigam [1998],
McCallum et al. [1998], Nigam et al.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 39
[2000], and Schapire and Singer [2000].
The documents are messages posted to
Usenet newsgroups, and the categories
are the newsgroups themselves.
—the APcollection, used by Cohen [1995a,
1995b], Cohen and Singer [1999], Lewis
and Catlett [1994], Lewis and Gale
[1994], Lewis et al. [1996], Schapire
and Singer [2000], and Schapire et al.
[1998].
We will not cover the experiments per-
formed on these collections for the same
reasons as those illustrated in footnote 20,
that is, because in no case have a signifi-
cant enough number of authors used the
same collection in the same experimen-
tal conditions, thus making comparisons
difficult.
7.3. Which Text Classifier Is Best?
The published experimental results, and
especially those listed in Table VI, allow
us to attempt some considerations on the
comparative performance of the TC meth-
ods discussed. However, we have to bear in
mind that comparisons are reliable only
when based on experiments performed
by the same author under carefully con-
trolled conditions. They are instead more
problematic when they involve different
experiments performed by different au-
thors. In this case various “background
conditions,” often extraneous to the learn-
ing algorithm itself, may influence the re-
sults. These may include, among others,
different choices in preprocessing (stem-
ming, etc.), indexing, dimensionality re-
duction, classifier parameter values, etc.,
but also different standards of compliance
with safe scientific practice (such as tun-
ing parameters on the test set rather than
on a separate validation set), which often
are not discussed in the published papers.
Two different methods may thus be
applied for comparing classifiers [Yang
1999]:
—direct comparison: classifiers
/
and
//
may be compared when they have been
tested on the same collection , usually
by the same researchers and with the
same background conditions. This is the
more reliable method.
—indirect comparison: classifiers
/
and

//
may be compared when
(1) they have been tested on collections

/
and
//
, respectively, typically
by different researchers and hence
with possibly different background
conditions;
(2) one or more “baseline” classifiers
¯

1
, . . . ,
¯

m
have been tested on both

/
and
//
by the direct comparison
method.
Test 2 gives an indication on the rela-
tive “hardness” of
/
and
//
; using this
and the results from Test 1, we may
obtain an indication on the relative ef-
fectiveness of
/
and
//
. For the rea-
sons discussedabove, this methodis less
reliable.
Anumber of interesting conclusions canbe
drawn from Table VI by using these two
methods. Concerning the relative “hard-
ness” of the five collections, if by
/
>
//
we indicate that
/
is a harder collection
than
//
, there seems to be enough evi-
dence that Reuters-22173 “ModLewis” ·
Reuters-22173 “ModWiener” > Reuters-
22173 “ModApt ´ e” ≈ Reuters-21578 “Mod-
Apt ´ e” > Reuters-21578[10] “ModApt ´ e.”
These facts are unsurprising; in particu-
lar, the first and the last inequalities are a
direct consequence of the peculiar charac-
teristics of Reuters-22173 “ModLewis” and
Reuters-21578[10] “ModApt ´ e” discussed in
Section 7.2.
Concerning the relative performance of
the classifiers, remembering the consid-
erations above we may attempt a few
conclusions:
—Boosting-based classifier committees,
support vector machines, example-
based methods, and regression methods
deliver top-notch performance. There
seems to be no sufficient evidence to
decidedly opt for either method; ef-
ficiency considerations or application-
dependent issues might play a role in
breaking the tie.
—Neural networks andon-line linear clas-
sifiers work very well, although slightly
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
40 Sebastiani
worse than the previously mentioned
methods.
—Batch linear classifiers (Rocchio) and
probabilistic Na¨ıve Bayes classifiers
look the worst of the learning-based
classifiers. For Rocchio, these results
confirm earlier results by Sch¨ utze
et al. [1995], who had found three classi-
fiers based on linear discriminant anal-
ysis, linear regression, and neural net-
works to perform about 15% better
than Rocchio. However, recent results
by Schapire et al. [1998] ranked Rocchio
along the best performers once near-
positives are used in training.
—The data in Table VI is hardly suf-
ficient to say anything about decision
trees. However, the work by Dumais
et al. [1998], in which a decision tree
classifier was shown to perform nearly
as well as their top performing system
(a SVM classifier), will probably renew
the interest indecisiontrees, aninterest
that had dwindled after the unimpres-
sive results reported in earlier litera-
ture [Cohen and Singer 1999; Joachims
1998; Lewis and Catlett 1994; Lewis
and Ringuette 1994].
—By far the lowest performance is
displayed by WORD, a classifier im-
plemented by Yang [1999] and not
including any learning component.
22
Concerning WORD and no-learning classi-
fiers, for completeness we should recall
that one of the highest effectiveness values
reported in the literature for the Reuters
collection (a .90 breakeven) belongs to
CONSTRUE, a manually constructed clas-
sifier. However, this classifier has never
been tested on the standard variants of
Reuters mentioned in Table VI, and it is
not clear [Yang 1999] whether the (small)
test set of Reuters-22173 “ModHayes” on
22
WORD is based on the comparison between docu-
ments andcategory names, eachtreatedas a vector of
weighted terms in the vector space model. WORD was
implemented by Yang with the only purpose of de-
termining the difference in effectiveness that adding
a learning component to a classifier brings about.
WORD is actually called STR in [Yang 1994; Yang and
Chute 1994]. Another no-learning classifier was pro-
posed in Wong et al. [1996].
which the .90 breakeven value was ob-
tained was chosen randomly, as safe sci-
entific practice would demand. Therefore,
the fact that this figure is indicative of the
performance of CONSTRUE, and of the man-
ual approach it represents, has been con-
vincingly questioned [Yang 1999].
It is important to bear in mind that
the considerations above are not abso-
lute statements (if there may be any)
on the comparative effectiveness of these
TC methods. One of the reasons is that
a particular applicative context may ex-
hibit very different characteristics from
the ones to be found in Reuters, and dif-
ferent classifiers may respond differently
to these characteristics. An experimen-
tal study by Joachims [1998] involving
support vector machines, k-NN, decision
trees, Rocchio, and Na¨ıve Bayes, showed
all these classifiers to have similar ef-
fectiveness on categories with≥300 pos-
itive training examples each. The fact
that this experiment involved the meth-
ods which have scored best (support vec-
tor machines, k-NN) and worst (Rocchio
and Na¨ıve Bayes) according to Table VI
shows that applicative contexts different
from Reuters may well invalidate conclu-
sions drawn on this latter.
Finally, a note about the worth of sta-
tistical significance testing. Few authors
have gone to the trouble of validating their
results by means of such tests. These tests
are useful for verifying how strongly the
experimental results support the claim
that a given system
/
is better than an-
other system
//
, or for verifying howmuch
a difference in the experimental setup af-
fects the measured effectiveness of a sys-
tem . Hull [1994] and Sch¨ utze et al.
[1995] have been among the first to work
in this direction, validating their results
by means of the ANOVA test and the Fried-
man test; the former is aimed at determin-
ing the significance of the difference in ef-
fectiveness between two methods in terms
of the ratio betweenthis difference and the
effectiveness variability across categories,
while the latter conducts a similar test by
using instead the rank positions of each
method within a category. Yang and Liu
[1999] defined a full suite of significance
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 41
tests, some of which apply to microaver-
agedandsome to macroaveragedeffective-
ness. They applied them systematically
to the comparison between five different
classifiers, and were thus able to infer fine-
grained conclusions about their relative
effectiveness. For other examples of sig-
nificance testing in TC, see Cohen [1995a,
1995b]; Cohen and Hirsh [1998], Joachims
[1997], Koller and Sahami [1997], Lewis
et al. [1996], and Wiener et al. [1995].
8. CONCLUSION
Automated TC is now a major research
area within the information systems dis-
cipline, thanks to a number of factors:
—Its domains of application are numer-
ous and important, and given the pro-
liferation of documents in digital form
they are bound to increase dramatically
in both number and importance.
—It is indispensable in many applica-
tions in which the sheer number of
the documents to be classified and the
short response time required by the ap-
plication make the manual alternative
implausible.
—It can improve the productivity of
human classifiers in applications in
which no classification decision can be
taken without a final human judgment
[Larkey and Croft 1996], by providing
tools that quickly “suggest” plausible
decisions.
—It has reached effectiveness levels com-
parable to those of trained profession-
als. The effectiveness of manual TC is
not 100%anyway [Cleverdon 1984] and,
more importantly, it is unlikely to be
improved substantially by the progress
of research. The levels of effectiveness
of automated TC are instead growing
at a steady pace, and even if they will
likely reach a plateau well below the
100% level, this plateau will probably
be higher than the effectiveness levels
of manual TC.
One of the reasons why from the early
’90s the effectiveness of text classifiers
has dramatically improved is the arrival
in the TC arena of ML methods that
are backed by strong theoretical motiva-
tions. Examples of these are multiplica-
tive weight updating (e.g., the WINNOW
family, WIDROW-HOFF, etc.), adaptive re-
sampling (e.g., boosting), and support vec-
tor machines, which provide a sharp con-
trast with relatively unsophisticated and
weak methods such as Rocchio. In TC,
ML researchers have found a challeng-
ing application, since datasets consisting
of hundreds of thousands of documents
and characterized by tens of thousands of
terms are widely available. This means
that TC is a good benchmark for checking
whether a given learning technique can
scale up to substantial sizes. In turn, this
probably means that the active involve-
ment of the ML community in TC is bound
to grow.
The success story of automated TC is
also going to encourage an extension of
its methods and techniques to neighbor-
ing fields of application. Techniques typ-
ical of automated TC have already been
extended successfully to the categoriza-
tionof documents expressed inslightly dif-
ferent media; for instance:
—very noisy text resulting from opti-
cal character recognition [Ittner et al.
1995; Junker and Hoch 1998]. In their
experiments Ittner et al. [1995] have
found that, by employing noisy texts
also in the training phase (i.e. texts af-
fected by the same source of noise that
is also at work in the test documents),
effectiveness levels comparable to those
obtainable in the case of standard text
can be achieved.
—speech transcripts [Myers et al.
2000; Schapire and Singer 2000].
For instance, Schapire and Singer
[2000] classified answers given to a
phone operator’s request “How may I
help you?” so as to be able to route the
call to a specialized operator according
to call type.
Concerning other more radically differ-
ent media, the situation is not as bright
(however, see Lim [1999] for an interest-
ing attempt at image categorization based
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
42 Sebastiani
on a textual metaphor). The reason for
this is that capturing real semantic con-
tent of nontextual media by automatic in-
dexing is still an open problem. While
there are systems that attempt to detect
content, for example, in images by rec-
ognizing shapes, color distributions, and
texture, the general problem of image se-
mantics is still unsolved. The main reason
is that natural language, the language of
the text medium, admits far fewer vari-
ations than the “languages” employed by
the other media. For instance, while the
concept of a house can be “triggered” by
relatively few natural language expres-
sions suchas house, houses, home, housing,
inhabiting, etc., it can be triggered by far
more images: the images of all the differ-
ent houses that exist, of all possible colors
and shapes, viewed from all possible per-
spectives, from all possible distances, etc.
If we had solved the multimedia indexing
problem in a satisfactory way, the general
methodology that we have discussed in
this paper for text would also apply to au-
tomated multimedia categorization, and
there are reasons to believe that the ef-
fectiveness levels could be as high. This
only adds to the common sentiment that
more research in automated content-
based indexing for multimedia documents
is needed.
ACKNOWLEDGMENTS
This paper owes a lot to the suggestions and con-
structive criticism of Norbert Fuhr and David Lewis.
Thanks also to Umberto Straccia for comments on
an earlier draft, to Evgeniy Gabrilovich, Daniela
Giorgetti, and Alessandro Moschitti for spotting mis-
takes in an earlier draft, and to Alessandro Sperduti
for many fruitful discussions.
REFERENCES
AMATI, G. AND CRESTANI, F. 1999. Probabilistic
learning for selective dissemination of informa-
tion. Inform. Process. Man. 35, 5, 633–654.
ANDROUTSOPOULOS, I., KOUTSIAS, J., CHANDRINOS, K. V.,
AND SPYROPOULOS, C. D. 2000. An experimen-
tal comparison of naive Bayesian and keyword-
based anti-spam filtering with personal e-mail
messages. In Proceedings of SIGIR-00, 23rd
ACM International Conference on Research and
Development in Information Retrieval (Athens,
Greece, 2000), 160–167.
APT´ E, C., DAMERAU, F. J., AND WEISS, S. M. 1994.
Automated learning of decision rules for text
categorization. ACM Trans. on Inform. Syst. 12,
3, 233–251.
ATTARDI, G., DI MARCO, S., AND SALVI, D. 1998. Cat-
egorization by context. J. Univers. Comput. Sci.
4, 9, 719–736.
BAKER, L. D. AND MCCALLUM, A. K. 1998. Distribu-
tional clustering of words for text classification.
In Proceedings of SIGIR-98, 21st ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Melbourne, Australia,
1998), 96–103.
BELKIN, N. J. AND CROFT, W. B. 1992. Information
filtering and information retrieval: two sides
of the same coin? Commun. ACM 35, 12, 29–
38.
BIEBRICHER, P., FUHR, N., KNORZ, G., LUSTIG, G., AND
SCHWANTNER, M. 1988. The automatic index-
ing system AIR/PHYS. From research to appli-
cation. In Proceedings of SIGIR-88, 11th ACM
International Conference on Research and De-
velopment in Information Retrieval (Grenoble,
France, 1988), 333–342. Also reprintedinSparck
Jones and Willett [1997], pp. 513–517.
BORKO, H. AND BERNICK, M. 1963. Automatic docu-
ment classification. J. Assoc. Comput. Mach. 10,
2, 151–161.
CAROPRESO, M. F., MATWIN, S., AND SEBASTIANI, F.
2001. A learner-independent evaluation of the
usefulness of statistical phrases for automated
text categorization. In Text Databases and Doc-
ument Management: Theory and Practice, A. G.
Chin, ed. Idea Group Publishing, Hershey, PA,
78–102.
CAVNAR, W. B. AND TRENKLE, J. M. 1994. N-gram-
based text categorization. In Proceedings of
SDAIR-94, 3rd Annual Symposium on Docu-
ment Analysis and Information Retrieval (Las
Vegas, NV, 1994), 161–175.
CHAKRABARTI, S., DOM, B. E., AGRAWAL, R., AND
RAGHAVAN, P. 1998a. Scalable feature selec-
tion, classification and signature generation for
organizing large text databases into hierarchical
topic taxonomies. J. Very Large Data Bases 7, 3,
163–178.
CHAKRABARTI, S., DOM, B. E., AND INDYK, P. 1998b.
Enhanced hypertext categorization using hyper-
links. In Proceedings of SIGMOD-98, ACM In-
ternational Conference on Management of Data
(Seattle, WA, 1998), 307–318.
CLACK, C., FARRINGDON, J., LIDWELL, P., AND YU, T.
1997. Autonomous document classification for
business. In Proceedings of the 1st International
Conference on Autonomous Agents (Marina del
Rey, CA, 1997), 201–208.
CLEVERDON, C. 1984. Optimizing convenient on-
line access to bibliographic databases. Inform.
Serv. Use 4, 1, 37–47. Also reprinted in Willett
[1988], pp. 32–41.
COHEN, W. W. 1995a. Learning to classify English
text with ILP methods. In Advances in Inductive
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 43
Logic Programming, L. De Raedt, ed. IOS Press,
Amsterdam, The Netherlands, 124–143.
COHEN, W. W. 1995b. Text categorization and rela-
tional learning. In Proceedings of ICML-95, 12th
International Conference on Machine Learning
(Lake Tahoe, CA, 1995), 124–132.
COHEN, W. W. AND HIRSH, H. 1998. Joins that gen-
eralize: text classification using WHIRL. In Pro-
ceedings of KDD-98, 4th International Confer-
ence on Knowledge Discovery and Data Mining
(New York, NY, 1998), 169–173.
COHEN, W. W. AND SINGER, Y. 1999. Context-
sensitive learning methods for text categoriza-
tion. ACM Trans. Inform. Syst. 17, 2, 141–
173.
COOPER, W. S. 1995. Some inconsistencies andmis-
nomers in probabilistic information retrieval.
ACM Trans. Inform. Syst. 13, 1, 100–111.
CREECY, R. M., MASAND, B. M., SMITH, S. J., AND WALTZ,
D. L. 1992. Trading MIPS and memory for
knowledge engineering: classifying census re-
turns on the Connection Machine. Commun.
ACM 35, 8, 48–63.
CRESTANI, F., LALMAS, M., VAN RIJSBERGEN, C. J., AND
CAMPBELL, I. 1998. “Is this document rele-
vant? . . . probably.” A survey of probabilistic
models in information retrieval. ACM Comput.
Surv. 30, 4, 528–552.
DAGAN, I., KAROV, Y., AND ROTH, D. 1997. Mistake-
driven learning in text categorization. In Pro-
ceedings of EMNLP-97, 2nd Conference on Em-
pirical Methods in Natural Language Processing
(Providence, RI, 1997), 55–63.
DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W.,
LANDAUER, T. K., AND HARSHMAN, R. 1990. In-
dexing by latent semantic indexing. J. Amer. Soc.
Inform. Sci. 41, 6, 391–407.
DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. 2001.
HMM-based passage models for document clas-
sification and ranking. In Proceedings of ECIR-
01, 23rd European Colloquium on Information
Retrieval Research (Darmstadt, Germany, 2001).
D´IAZ ESTEBAN, A., DE BUENAGA RODR´IGUEZ, M., URE˜ NA
L´ OPEZ, L. A., AND GARC´IA VEGA, M. 1998. In-
tegrating linguistic resources in an uniform
way for text classification tasks. In Proceed-
ings of LREC-98, 1st International Conference on
Language Resources and Evaluation (Grenada,
Spain, 1998), 1197–1204.
DOMINGOS, P. AND PAZZANI, M. J. 1997. On the the
optimality of the simple Bayesian classifier un-
der zero-one loss. Mach. Learn. 29, 2–3, 103–130.
DRUCKER, H., VAPNIK, V., AND WU, D. 1999. Auto-
matic text categorization and its applications to
text retrieval. IEEE Trans. Neural Netw. 10, 5,
1048–1054.
DUMAIS, S. T. AND CHEN, H. 2000. Hierarchical clas-
sification of Web content. In Proceedings of
SIGIR-00, 23rd ACM International Conference
on Research and Development in Information
Retrieval (Athens, Greece, 2000), 256–263.
DUMAIS, S. T., PLATT, J., HECKERMAN, D., AND SAHAMI,
M. 1998. Inductive learning algorithms and
representations for text categorization. In Pro-
ceedings of CIKM-98, 7th ACM International
Conference on Information and Knowledge Man-
agement (Bethesda, MD, 1998), 148–155.
ESCUDERO, G., M` ARQUEZ, L., AND RIGAU, G. 2000.
Boosting applied to word sense disambiguation.
In Proceedings of ECML-00, 11th European Con-
ference on Machine Learning (Barcelona, Spain,
2000), 129–141.
FIELD, B. 1975. Towards automatic indexing: auto-
matic assignment of controlled-language index-
ing and classification from free indexing. J. Doc-
ument. 31, 4, 246–265.
FORSYTH, R. S. 1999. Newdirections in text catego-
rization. In Causal Models and Intelligent Data
Management, A. Gammerman, ed. Springer,
Heidelberg, Germany, 151–185.
FRASCONI, P., SODA, G., AND VULLO, A. 2002. Text
categorization for multi-page documents: A
hybrid naive Bayes HMM approach. J. Intell.
Inform. Syst. 18, 2/3 (March–May), 195–217.
FUHR, N. 1985. Aprobabilistic model of dictionary-
based automatic indexing. In Proceedings of
RIAO-85, 1st International Conference “Re-
cherche d’Information Assistee par Ordinateur”
(Grenoble, France, 1985), 207–216.
FUHR, N. 1989. Models for retrieval with proba-
bilistic indexing. Inform. Process. Man. 25, 1, 55–
72.
FUHR, N. AND BUCKLEY, C. 1991. A probabilistic
learning approach for document indexing. ACM
Trans. Inform. Syst. 9, 3, 223–248.
FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG, G.,
SCHWANTNER, M., AND TZERAS, K. 1991.
AIR/X—a rule-based multistage indexing
system for large subject fields. In Proceed-
ings of RIAO-91, 3rd International Conference
“Recherche d’Information Assistee par Ordina-
teur” (Barcelona, Spain, 1991), 606–623.
FUHR, N. AND KNORZ, G. 1984. Retrieval test
evaluation of a rule-based automated index-
ing (AIR/PHYS). In Proceedings of SIGIR-84,
7th ACM International Conference on Research
and Development in Information Retrieval
(Cambridge, UK, 1984), 391–408.
FUHR, N. AND PFEIFER, U. 1994. Probabilistic in-
formation retrieval as combination of abstrac-
tion inductive learning and probabilistic as-
sumptions. ACM Trans. Inform. Syst. 12, 1,
92–115.
F¨ URNKRANZ, J. 1999. Exploiting structural infor-
mation for text classification on the WWW.
In Proceedings of IDA-99, 3rd Symposium on
Intelligent Data Analysis (Amsterdam, The
Netherlands, 1999), 487–497.
GALAVOTTI, L., SEBASTIANI, F., AND SIMI, M. 2000.
Experiments on the use of feature selec-
tion and negative evidence in automated text
categorization. In Proceedings of ECDL-00,
4th European Conference on Research and
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
44 Sebastiani
Advanced Technology for Digital Libraries
(Lisbon, Portugal, 2000), 59–68.
GALE, W. A., CHURCH, K. W., AND YAROWSKY, D. 1993.
A method for disambiguating word senses in a
large corpus. Comput. Human. 26, 5, 415–439.
G¨ OVERT, N., LALMAS, M., AND FUHR, N. 1999. A
probabillistic description-oriented approach for
categorising Web documents. In Proceedings of
CIKM-99, 8th ACM International Conference
on Information and Knowledge Management
(Kansas City, MO, 1999), 475–482.
GRAY, W. A. AND HARLEY, A. J. 1971. Computer-
assisted indexing. Inform. Storage Retrieval 7,
4, 167–174.
GUTHRIE, L., WALKER, E., AND GUTHRIE, J. A. 1994.
Document classification by machine: theory
and practice. InProceedings of COLING-94, 15th
International Conference on Computational Lin-
guistics (Kyoto, Japan, 1994), 1059–1063.
HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I. B.,
AND SCHMANDT, L. M. 1990. Tcs: a shell for
content-based text categorization. In Proceed-
ings of CAIA-90, 6th IEEE Conference on Arti-
ficial Intelligence Applications (Santa Barbara,
CA, 1990), 320–326.
HEAPS, H. 1973. A theory of relevance for au-
tomatic document classification. Inform. Con-
trol 22, 3, 268–278.
HERSH, W., BUCKLEY, C., LEONE, T., AND HICKMAN, D.
1994. OHSUMED: an interactive retrieval evalu-
ation and new large text collection for research.
In Proceedings of SIGIR-94, 17th ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Dublin, Ireland, 1994),
192–201.
HULL, D. A. 1994. Improving text retrieval for the
routing problemusing latent semantic indexing.
In Proceedings of SIGIR-94, 17th ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Dublin, Ireland, 1994),
282–289.
HULL, D. A., PEDERSEN, J. O., AND SCH¨ UTZE, H. 1996.
Method combination for document filtering. In
Proceedings of SIGIR-96, 19th ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Z¨ urich, Switzerland,
1996), 279–288.
ITTNER, D. J., LEWIS, D. D., AND AHN, D. D. 1995.
Text categorization of low quality images. In
Proceedings of SDAIR-95, 4th Annual Sympo-
sium on Document Analysis and Information
Retrieval (Las Vegas, NV, 1995), 301–315.
IWAYAMA, M. AND TOKUNAGA, T. 1995. Cluster-based
text categorization: a comparison of category
search strategies. In Proceedings of SIGIR-95,
18th ACM International Conference on Research
and Development in Information Retrieval
(Seattle, WA, 1995), 273–281.
IYER, R. D., LEWIS, D. D., SCHAPIRE, R. E., SINGER, Y.,
AND SINGHAL, A. 2000. Boosting for document
routing. In Proceedings of CIKM-00, 9th ACM
International Conference on Information and
Knowledge Management (McLean, VA, 2000),
70–77.
JOACHIMS, T. 1997. A probabilistic analysis of the
Rocchio algorithm with TFIDF for text cat-
egorization. In Proceedings of ICML-97, 14th
International Conference on Machine Learning
(Nashville, TN, 1997), 143–151.
JOACHIMS, T. 1998. Text categorization with sup-
port vector machines: learning with many rel-
evant features. In Proceedings of ECML-98,
10th European Conference on Machine Learning
(Chemnitz, Germany, 1998), 137–142.
JOACHIMS, T. 1999. Transductive inference for text
classification using support vector machines. In
Proceedings of ICML-99, 16thInternational Con-
ference on Machine Learning (Bled, Slovenia,
1999), 200–209.
JOACHIMS, T. AND SEBASTIANI, F. 2002. Guest editors’
introduction to the special issue on automated
text categorization. J. Intell. Inform. Syst. 18, 2/3
(March-May), 103–105.
JOHN, G. H., KOHAVI, R., AND PFLEGER, K. 1994. Ir-
relevant features and the subset selection prob-
lem. In Proceedings of ICML-94, 11th Interna-
tional Conference on Machine Learning (New
Brunswick, NJ, 1994), 121–129.
JUNKER, M. AND ABECKER, A. 1997. Exploiting the-
saurus knowledge in rule induction for text clas-
sification. In Proceedings of RANLP-97, 2nd In-
ternational Conference on Recent Advances in
Natural Language Processing (Tzigov Chark,
Bulgaria, 1997), 202–207.
JUNKER, M. AND HOCH, R. 1998. An experimen-
tal evaluation of OCR text representations for
learning document classifiers. Internat. J. Docu-
ment Analysis and Recognition 1, 2, 116–122.
KESSLER, B., NUNBERG, G., AND SCH¨ UTZE, H. 1997.
Automatic detection of text genre. In Proceed-
ings of ACL-97, 35th Annual Meeting of the Asso-
ciation for Computational Linguistics (Madrid,
Spain, 1997), 32–38.
KIM, Y.-H., HAHN, S.-Y., AND ZHANG, B.-T. 2000. Text
filtering by boosting naive Bayes classifiers. In
Proceedings of SIGIR-00, 23rd ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Athens, Greece, 2000),
168–175.
KLINKENBERG, R. AND JOACHIMS, T. 2000. Detect-
ing concept drift with support vector machines.
In Proceedings of ICML-00, 17th International
Conference on Machine Learning (Stanford, CA,
2000), 487–494.
KNIGHT, K. 1999. Mining online text. Commun.
ACM 42, 11, 58–61.
KNORZ, G. 1982. A decision theory approach to
optimal automated indexing. In Proceedings of
SIGIR-82, 5th ACM International Conference
on Research and Development in Information
Retrieval (Berlin, Germany, 1982), 174–193.
KOLLER, D. AND SAHAMI, M. 1997. Hierarchically
classifying documents using very few words. In
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 45
Proceedings of ICML-97, 14thInternational Con-
ference on Machine Learning (Nashville, TN,
1997), 170–178.
KORFHAGE, R. R. 1997. Information Storage and
Retrieval. Wiley Computer Publishing, New
York, NY.
LAM, S. L. AND LEE, D. L. 1999. Feature reduc-
tion for neural network based text categoriza-
tion. In Proceedings of DASFAA-99, 6th IEEE
International Conference on Database Advanced
Systems for Advanced Application (Hsinchu,
Taiwan, 1999), 195–202.
LAM, W. AND HO, C. Y. 1998. Using a generalized
instance set for automatic text categorization.
In Proceedings of SIGIR-98, 21st ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Melbourne, Australia,
1998), 81–89.
LAM, W., LOW, K. F., AND HO, C. Y. 1997. Using a
Bayesian network induction approach for text
categorization. In Proceedings of IJCAI-97, 15th
International Joint Conference on Artificial In-
telligence (Nagoya, Japan, 1997), 745–750.
LAM, W., RUIZ, M. E., AND SRINIVASAN, P. 1999. Auto-
matic text categorization and its applications to
text retrieval. IEEE Trans. Knowl. Data Engin.
11, 6, 865–879.
LANG, K. 1995. NEWSWEEDER: learning to filter net-
news. In Proceedings of ICML-95, 12th Interna-
tional Conference on Machine Learning (Lake
Tahoe, CA, 1995), 331–339.
LARKEY, L. S. 1998. Automatic essay grading us-
ing text categorization techniques. In Pro-
ceedings of SIGIR-98, 21st ACM International
Conference on Research and Development in
Information Retrieval (Melbourne, Australia,
1998), 90–95.
LARKEY, L. S. 1999. A patent search and classifica-
tion system. In Proceedings of DL-99, 4th ACM
Conference on Digital Libraries (Berkeley, CA,
1999), 179–187.
LARKEY, L. S. AND CROFT, W. B. 1996. Combining
classifiers in text categorization. In Proceedings
of SIGIR-96, 19thACMInternational Conference
on Research and Development in Information
Retrieval (Z¨ urich, Switzerland, 1996), 289–297.
LEWIS, D. D. 1992a. An evaluation of phrasal and
clustered representations on a text categoriza-
tion task. In Proceedings of SIGIR-92, 15th ACM
International Conference on Research and Devel-
opment in Information Retrieval (Copenhagen,
Denmark, 1992), 37–50.
LEWIS, D. D. 1992b. Representation and Learn-
ing in Information Retrieval. Ph. D. thesis, De-
partment of Computer Science, University of
Massachusetts, Amherst, MA.
LEWIS, D. D. 1995a. Evaluating and optmizing au-
tonomous text classification systems. In Pro-
ceedings of SIGIR-95, 18th ACM International
Conference on Research and Development in
Information Retrieval (Seattle, WA, 1995), 246–
254.
LEWIS, D. D. 1995b. A sequential algorithm for
training text classifiers: corrigendum and addi-
tional data. SIGIR Forum 29, 2, 13–19.
LEWIS, D. D. 1995c. The TREC-4 filtering track:
description and analysis. In Proceedings
of TREC-4, 4th Text Retrieval Conference
(Gaithersburg, MD, 1995), 165–180.
LEWIS, D. D. 1998. Naive (Bayes) at forty: The
independence assumption in information re-
trieval. In Proceedings of ECML-98, 10th
European Conference on Machine Learning
(Chemnitz, Germany, 1998), 4–15.
LEWIS, D. D. AND CATLETT, J. 1994. Heterogeneous
uncertainty sampling for supervised learning. In
Proceedings of ICML-94, 11thInternational Con-
ference on Machine Learning (New Brunswick,
NJ, 1994), 148–156.
LEWIS, D. D. AND GALE, W. A. 1994. A sequential
algorithm for training text classifiers. In Pro-
ceedings of SIGIR-94, 17th ACM International
Conference on Research and Development in
Information Retrieval (Dublin, Ireland, 1994),
3–12. See also Lewis [1995b].
LEWIS, D. D. AND HAYES, P. J. 1994. Guest editorial
for the special issue on text categorization. ACM
Trans. Inform. Syst. 12, 3, 231.
LEWIS, D. D. AND RINGUETTE, M. 1994. A compar-
ison of two learning algorithms for text cat-
egorization. In Proceedings of SDAIR-94, 3rd
Annual Symposium on Document Analysis and
Information Retrieval (Las Vegas, NV, 1994),
81–93.
LEWIS, D. D., SCHAPIRE, R. E., CALLAN, J. P., AND PAPKA,
R. 1996. Training algorithms for linear text
classifiers. In Proceedings of SIGIR-96, 19th
ACM International Conference on Research and
Development in Information Retrieval (Z¨ urich,
Switzerland, 1996), 298–306.
LI, H. AND YAMANISHI, K. 1999. Text classification
using ESC-based stochastic decision lists. In
Proceedings of CIKM-99, 8th ACM International
Conference on Information and Knowledge Man-
agement (Kansas City, MO, 1999), 122–130.
LI, Y. H. AND JAIN, A. K. 1998. Classification of text
documents. Comput. J. 41, 8, 537–546.
LIDDY, E. D., PAIK, W., AND YU, E. S. 1994. Text cat-
egorization for multiple users based on seman-
tic features from a machine-readable dictionary.
ACM Trans. Inform. Syst. 12, 3, 278–295.
LIERE, R. AND TADEPALLI, P. 1997. Active learning
with committees for text categorization. In Pro-
ceedings of AAAI-97, 14th Conference of the
American Association for Artificial Intelligence
(Providence, RI, 1997), 591–596.
LIM, J. H. 1999. Learnable visual keywords for im-
age classification. In Proceedings of DL-99, 4th
ACM Conference on Digital Libraries (Berkeley,
CA, 1999), 139–145.
MANNING, C. AND SCH¨ UTZE, H. 1999. Foundations of
Statistical Natural Language Processing. MIT
Press, Cambridge, MA.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
46 Sebastiani
MARON, M. 1961. Automatic indexing: an experi-
mental inquiry. J. Assoc. Comput. Mach. 8, 3,
404–417.
MASAND, B. 1994. Optimising confidence of text
classification by evolution of symbolic expres-
sions. In Advances in Genetic Programming,
K. E. Kinnear, ed. MIT Press, Cambridge, MA,
Chapter 21, 459–476.
MASAND, B., LINOFF, G., AND WALTZ, D. 1992. Clas-
sifying news stories using memory-based rea-
soning. In Proceedings of SIGIR-92, 15th ACM
International Conference on Research and Devel-
opment in Information Retrieval (Copenhagen,
Denmark, 1992), 59–65.
MCCALLUM, A. K. AND NIGAM, K. 1998. Employ-
ing EM in pool-based active learning for text
classification. In Proceedings of ICML-98, 15th
International Conference on Machine Learning
(Madison, WI, 1998), 350–358.
MCCALLUM, A. K., ROSENFELD, R., MITCHELL, T. M., AND
NG, A. Y. 1998. Improving text classification
by shrinkage in a hierarchy of classes. In Pro-
ceedings of ICML-98, 15th International Confer-
ence on Machine Learning (Madison, WI, 1998),
359–367.
MERKL, D. 1998. Text classification with self-
organizing maps: Some lessons learned. Neuro-
computing 21, 1/3, 61–77.
MITCHELL, T. M. 1996. Machine Learning. McGraw
Hill, New York, NY.
MLADENI ´ C, D. 1998. Feature subset selection in
text learning. In Proceedings of ECML-98,
10th European Conference on Machine Learning
(Chemnitz, Germany, 1998), 95–100.
MLADENI ´ C, D. AND GROBELNIK, M. 1998. Word se-
quences as features in text-learning. In Pro-
ceedings of ERK-98, the Seventh Electrotechni-
cal andComputer Science Conference (Ljubljana,
Slovenia, 1998), 145–148.
MOULINIER, I. AND GANASCIA, J.-G. 1996. Applying
an existing machine learning algorithm to text
categorization. In Connectionist, Statistical,
and Symbolic Approaches to Learning for Nat-
ural Language Processing, S. Wermter, E. Riloff,
and G. Schaler, eds. Springer Verlag, Heidelberg,
Germany, 343–354.
MOULINIER, I., RA˘ SKINIS, G., AND GANASCIA, J.-G. 1996.
Text categorization: a symbolic approach. In
Proceedings of SDAIR-96, 5th Annual Sympo-
sium on Document Analysis and Information
Retrieval (Las Vegas, NV, 1996), 87–99.
MYERS, K., KEARNS, M., SINGH, S., AND WALKER,
M. A. 2000. A boosting approach to topic
spotting on subdialogues. In Proceedings of
ICML-00, 17th International Conference on Ma-
chine Learning (Stanford, CA, 2000), 655–
662.
NG, H. T., GOH, W. B., AND LOW, K. L. 1997. Fea-
ture selection, perceptron learning, and a us-
ability case study for text categorization. In Pro-
ceedings of SIGIR-97, 20th ACM International
Conference on Research and Development in
Information Retrieval (Philadelphia, PA, 1997),
67–73.
NIGAM, K., MCCALLUM, A. K., THRUN, S., AND MITCHELL,
T. M. 2000. Text classification from labeled
and unlabeled documents using EM. Mach.
Learn. 39, 2/3, 103–134.
OH, H.-J., MYAENG, S. H., AND LEE, M.-H. 2000. A
practical hypertext categorization method using
links and incrementally available class informa-
tion. In Proceedings of SIGIR-00, 23rd ACM In-
ternational Conference onResearchandDevelop-
ment in Information Retrieval (Athens, Greece,
2000), 264–271.
PAZIENZA, M. T., ed. 1997. Information Extraction.
Lecture Notes in Computer Science, Vol. 1299.
Springer, Heidelberg, Germany.
RILOFF. E. 1995. Little words can make a big dif-
ference for text classification. In Proceedings of
SIGIR-95, 18th ACM International Conference
on Research and Development in Information
Retrieval (Seattle, WA, 1995), 130–136.
RILOFF, E. AND LEHNERT, W. 1994. Information ex-
tractionas a basis for high-precisiontext classifi-
cation. ACMTrans. Inform. Syst. 12, 3, 296–333.
ROBERTSON, S. E. AND HARDING, P. 1984. Probabilis-
tic automatic indexing by learning from human
indexers. J. Document. 40, 4, 264–270.
ROBERTSON, S. E. AND SPARCK JONES, K. 1976. Rel-
evance weighting of search terms. J. Amer. Soc.
Inform. Sci. 27, 3, 129–146. Also reprinted in
Willett [1988], pp. 143–160.
ROTH, D. 1998. Learning to resolve natural
language ambiguities: a unified approach. In
Proceedings of AAAI-98, 15th Conference of the
American Association for Artificial Intelligence
(Madison, WI, 1998), 806–813.
RUIZ, M. E. AND SRINIVASAN, P. 1999. Hierarchical
neural networks for text categorization. In Pro-
ceedings of SIGIR-99, 22nd ACM International
Conference on Research and Development in
Information Retrieval (Berkeley, CA, 1999),
281–282.
SABLE, C. L. AND HATZIVASSILOGLOU, V. 2000. Text-
based approaches for non-topical image catego-
rization. Internat. J. Dig. Libr. 3, 3, 261–275.
SALTON, G. AND BUCKLEY, C. 1988. Term-weighting
approaches in automatic text retrieval. Inform.
Process. Man. 24, 5, 513–523. Also reprinted in
Sparck Jones and Willett [1997], pp. 323–328.
SALTON, G., WONG, A., AND YANG, C. 1975. A vector
space model for automatic indexing. Commun.
ACM 18, 11, 613–620. Also reprinted in Sparck
Jones and Willett [1997], pp. 273–280.
SARACEVIC, T. 1975. Relevance: a review of and
a framework for the thinking on the notion in
information science. J. Amer. Soc. Inform. Sci.
26, 6, 321–343. Also reprinted in Sparck Jones
and Willett [1997], pp. 143–165.
SCHAPIRE, R. E. AND SINGER, Y. 2000. BoosTexter:
a boosting-based system for text categorization.
Mach. Learn. 39, 2/3, 135–168.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Machine Learning in Automated Text Categorization 47
SCHAPIRE, R. E., SINGER, Y., AND SINGHAL, A. 1998.
Boosting and Rocchio applied to text filtering.
In Proceedings of SIGIR-98, 21st ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Melbourne, Australia,
1998), 215–223.
SCH¨ UTZE, H. 1998. Automatic word sense discrimina-
tion. Computat. Ling. 24, 1, 97–124.
SCH¨ UTZE, H., HULL, D. A., AND PEDERSEN, J. O. 1995.
A comparison of classifiers and document repre-
sentations for the routing problem. In Proceed-
ings of SIGIR-95, 18th ACM International Con-
ference on Research and Development in Infor-
mation Retrieval (Seattle, WA, 1995), 229–237.
SCOTT, S. AND MATWIN, S. 1999. Feature engineer-
ing for text classification. In Proceedings of
ICML-99, 16th International Conference on Ma-
chine Learning (Bled, Slovenia, 1999), 379–388.
SEBASTIANI, F., SPERDUTI, A., AND VALDAMBRINI, N.
2000. An improved boosting algorithm and its
application to automated text categorization. In
Proceedings of CIKM-00, 9th ACM International
Conference on Information and Knowledge
Management (McLean, VA, 2000), 78–85.
SINGHAL, A., MITRA, M., AND BUCKLEY, C. 1997.
Learning routing queries in a query zone. In
Proceedings of SIGIR-97, 20th ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Philadelphia, PA,
1997), 25–32.
SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY,
C. 1996. Document length normalization.
Inform. Process. Man. 32, 5, 619–633.
SLONIM, N. AND TISHBY, N. 2001. The power of word
clusters for text classification. In Proceedings
of ECIR-01, 23rd European Colloquium on
Information Retrieval Research (Darmstadt,
Germany, 2001).
SPARCK JONES, K. AND WILLETT, P., eds. 1997.
Readings in Information Retrieval. Morgan
Kaufmann, San Mateo, CA.
TAIRA, H. AND HARUNO, M. 1999. Feature selection
in SVM text categorization. In Proceedings
of AAAI-99, 16th Conference of the American
Association for Artificial Intelligence (Orlando,
FL, 1999), 480–486.
TAURITZ, D. R., KOK, J. N., AND SPRINKHUIZEN-KUYPER,
I. G. 2000. Adaptive information filtering
using evolutionary computation. Inform. Sci.
122, 2–4, 121–140.
TUMER, K. AND GHOSH, J. 1996. Error correlation
and error reduction in ensemble classifiers.
Connection Sci. 8, 3-4, 385–403.
TZERAS, K. AND HARTMANN, S. 1993. Automatic
indexing based on Bayesian inference networks.
In Proceedings of SIGIR-93, 16th ACM Interna-
tional Conference on Research and Development
in Information Retrieval (Pittsburgh, PA, 1993),
22–34.
VAN RIJSBERGEN, C. J. 1977. A theoretical basis for
the use of co-occurrence data in information
retrieval. J. Document. 33, 2, 106–119.
VAN RIJSBERGEN, C. J. 1979. Information Retrieval,
2nd ed. Butterworths, London, UK. Available at
http://www.dcs.gla.ac.uk/Keith.
WEIGEND, A. S., WIENER, E. D., AND PEDERSEN, J. O.
1999. Exploiting hierarchy in text catagoriza-
tion. Inform. Retr. 1, 3, 193–216.
WEISS, S. M., APT´ E, C., DAMERAU, F. J., JOHNSON, D.
E., OLES, F. J., GOETZ, T., AND HAMPP, T. 1999.
Maximizing text-mining performance. IEEE
Intell. Syst. 14, 4, 63–69.
WIENER, E. D., PEDERSEN, J. O., AND WEIGEND, A. S.
1995. A neural network approach to topic spot-
ting. In Proceedings of SDAIR-95, 4th Annual
Symposiumon Document Analysis and Informa-
tion Retrieval (Las Vegas, NV, 1995), 317–332.
WILLETT, P., ed. 1988. Document Retrieval Sys-
tems. Taylor Graham, London, UK.
WONG, J. W., KAN, W.-K., AND YOUNG, G. H. 1996.
ACTION: automatic classification for full-text
documents. SIGIR Forum 30, 1, 26–41.
YANG, Y. 1994. Expert network: effective and
efficient learning from human decisions in text
categorisation and retrieval. In Proceedings of
SIGIR-94, 17th ACM International Conference
on Research and Development in Information
Retrieval (Dublin, Ireland, 1994), 13–22.
YANG, Y. 1995. Noise reduction in a statistical ap-
proach to text categorization. In Proceedings of
SIGIR-95, 18th ACM International Conference
on Research and Development in Information
Retrieval (Seattle, WA, 1995), 256–263.
YANG, Y. 1999. An evaluation of statistical ap-
proaches to text categorization. Inform. Retr. 1,
1–2, 69–90.
YANG, Y. AND CHUTE, C. G. 1994. An example-based
mapping method for text categorization and re-
trieval. ACMTrans. Inform. Syst. 12, 3, 252–277.
YANG, Y. AND LIU, X. 1999. A re-examination of
text categorization methods. In Proceedings of
SIGIR-99, 22nd ACM International Conference
on Research and Development in Information
Retrieval (Berkeley, CA, 1999), 42–49.
YANG, Y. AND PEDERSEN, J. O. 1997. A comparative
study on feature selection in text categorization.
In Proceedings of ICML-97, 14th International
Conference on Machine Learning (Nashville,
TN, 1997), 412–420.
YANG, Y., SLATTERY, S., AND GHANI, R. 2002. A study
of approaches to hypertext categorization. J. In-
tell. Inform. Syst. 18, 2/3 (March-May), 219–241.
YU, K. L. AND LAM, W. 1998. A new on-line learn-
ing algorithm for adaptive text filtering. In
Proceedings of CIKM-98, 7th ACM International
Conference on Information and Knowledge
Management (Bethesda, MD, 1998), 156–160.
Received December 1999; revised February 2001; accepted July 2001
ACM Computing Surveys, Vol. 34, No. 1, March 2002.

2 hierarchical catalogues of Web resources, and in general any application requiring document organization or selective and adaptive document dispatching. Until the late ’80s the most popular approach to TC, at least in the “operational” (i.e., real-world applications) community, was a knowledge engineering (KE) one, consisting in manually defining a set of rules encoding expert knowledge on how to classify documents under the given categories. In the ’90s this approach has increasingly lost popularity (especially in the research community) in favor of the machine learning (ML) paradigm, according to which a general inductive process automatically builds an automatic text classifier by learning, from a set of preclassified documents, the characteristics of the categories of interest. The advantages of this approach are an accuracy comparable to that achieved by human experts, and a considerable savings in terms of expert labor power, since no intervention from either knowledge engineers or domain experts is needed for the construction of the classifier or for its porting to a different set of categories. It is the ML approach to TC that this paper concentrates on. Current-day TC is thus a discipline at the crossroads of ML and IR, and as such it shares a number of characteristics with other tasks such as information/ knowledge extraction from texts and text mining [Knight 1999; Pazienza 1997]. There is still considerable debate on where the exact border between these disciplines lies, and the terminology is still evolving. “Text mining” is increasingly being used to denote all the tasks that, by analyzing large quantities of text and detecting usage patterns, try to extract probably useful (although only probably correct) information. According to this view, TC is an instance of text mining. TC enjoys quite a rich literature now, but this is still fairly scattered.1 Although two international journals have devoted special issues to
1

Sebastiani this topic [Joachims and Sebastiani 2002; Lewis and Hayes 1994], there are no systematic treatments of the subject: there are neither textbooks nor journals entirely devoted to TC yet, and Manning ¨ and Schutze [1999, Chapter 16] is the only chapter-length treatment of the subject. As a note, we should warn the reader that the term “automatic text classification” has sometimes been used in the literature to mean things quite different from the ones discussed here. Aside from (i) the automatic assignment of documents to a predefined set of categories, which is the main topic of this paper, the term has also been used to mean (ii) the automatic identification of such a set of categories (e.g., Borko and Bernick [1963]), or (iii) the automatic identification of such a set of categories and the grouping of documents under them (e.g., Merkl [1998]), a task usually called text clustering, or (iv) any activity of placing text items into groups, a task that has thus both TC and text clustering as particular instances [Manning ¨ and Schutze 1999]. This paper is organized as follows. In Section 2 we formally define TC and its various subcases, and in Section 3 we review its most important applications. Section 4 describes the main ideas underlying the ML approach to classification. Our discussion of text classification starts in Section 5 by introducing text indexing, that is, the transformation of textual documents into a form that can be interpreted by a classifier-building algorithm and by the classifier eventually built by it. Section 6 tackles the inductive construction of a text classifier from a “training” set of preclassified documents. Section 7 discusses the evaluation of text classifiers. Section 8 concludes, discussing open issues and possible avenues of further research for TC.
2. TEXT CATEGORIZATION 2.1. A Definition of Text Categorization

A fully searchable bibliography on TC created and maintained by this author is available at http:// liinwww.ira.uka.de/bibliography/Ai/automated.text. categorization.html.

Text categorization is the task of assigning a Boolean value to each pair d j , ci ∈ D × C, where D is a domain of documents and
ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Machine Learning in Automated Text Categorization C = {c1 , . . . , c|C| } is a set of predefined categories. A value of T assigned to d j , ci indicates a decision to file d j under ci , while a value of F indicates a decision not to file d j under ci . More formally, the task is to approximate the unknown target function ˘ : D × C → {T, F } (that describes how documents ought to be classified) by means of a function : D × C → {T, F } called the classifier (aka rule, or hypothesis, or model) such that ˘ and “coincide as much as possible.” How to precisely define and measure this coincidence (called effectiveness) will be discussed in Section 7.1. From now on we will assume that: —The categories are just symbolic labels, and no additional knowledge (of a procedural or declarative nature) of their meaning is available. —No exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; therefore, classification must be accomplished on the basis of endogenous knowledge only (i.e., knowledge extracted from the documents). In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is not assumed to be available. The TC methods we will discuss are thus completely general, and do not depend on the availability of special-purpose resources that might be unavailable or costly to develop. Of course, these assumptions need not be verified in operational settings, where it is legitimate to use any source of information that might be available or deemed worth developing [D´az Esteban et al. 1998; Junker and ı Abecker 1997]. Relying only on endogenous knowledge means classifying a document based solely on its semantics, and given that the semantics of a document is a subjective notion, it follows that the membership of a document in a category (pretty much as the relevance of a document to an information need in IR [Saracevic 1975]) cannot be decided deterministically. This is exemplified by the
ACM Computing Surveys, Vol. 34, No. 1, March 2002.

3

phenomenon of inter-indexer inconsistency [Cleverdon 1984]: when two human experts decide whether to classify document d j under category ci , they may disagree, and this in fact happens with relatively high frequency. A news article on Clinton attending Dizzy Gillespie’s funeral could be filed under Politics, or under Jazz, or under both, or even under neither, depending on the subjective judgment of the expert.
2.2. Single-Label Versus Multilabel Text Categorization

Different constraints may be enforced on the TC task, depending on the application. For instance we might need that, for a given integer k, exactly k (or ≤ k, or ≥ k) elements of C be assigned to each d j ∈ D. The case in which exactly one category must be assigned to each d j ∈ D is often called the single-label (a.k.a. nonoverlapping categories) case, while the case in which any number of categories from 0 to |C| may be assigned to the same d j ∈ D is dubbed the multilabel (aka overlapping categories) case. A special case of singlelabel TC is binary TC, in which each d j ∈ D must be assigned either to category ci or to its complement ci . ¯ From a theoretical point of view, the binary case (hence, the single-label case, too) is more general than the multilabel, since an algorithm for binary classification can also be used for multilabel classification: one needs only transform the problem of multilabel classification under {c1 , . . . , c|C| } into |C| independent problems ¯ of binary classification under {ci , ci }, for i = 1, . . . , |C|. However, this requires that categories be stochastically independent of each other, that is, for any c , c , the value of ˘ (d j , c ) does not depend on the value of ˘ (d j , c ) and vice versa; this is usually assumed to be the case (applications in which this is not the case are discussed in Section 3.5). The converse is not true: an algorithm for multilabel classification cannot be used for either binary or single-label classification. In fact, given a document d j to classify, (i) the classifier might attribute k > 1 categories to d j , and it might not be obvious how to

4 choose a “most appropriate” category from them; or (ii) the classifier might attribute to d j no category at all, and it might not be obvious how to choose a “least inappropriate” category from C. In the rest of the paper, unless explicitly mentioned, we will deal with the binary case. There are various reasons for this: —The binary case is important in itself because important TC applications, including filtering (see Section 3.3), consist of binary classification problems (e.g., deciding whether d j is about Jazz or not). In TC, most binary classification problems feature unevenly populated categories (e.g., much fewer documents are about Jazz than are not) and unevenly characterized categories (e.g., what is about Jazz can be characterized much better than what is not). —Solving the binary case also means solving the multilabel case, which is also representative of important TC applications, including automated indexing for Boolean systems (see Section 3.1). —Most of the TC literature is couched in terms of the binary case. —Most techniques for binary classification are just special cases of existing techniques for the single-label case, and are simpler to illustrate than these latter. This ultimately means that we will view classification under C = {c1 , . . . , c|C| } as consisting of |C| independent problems of classifying the documents in D under a given category ci , for i = 1, . . . , |C|. A classifier for ci is then a function i : D → {T, F } that approximates an unknown target function ˘ i : D → {T, F }.
2.3. Category-Pivoted Versus Document-Pivoted Text Categorization

Sebastiani categorization—CPC). This distinction is more pragmatic than conceptual, but is important since the sets C and D might not be available in their entirety right from the start. It is also relevant to the choice of the classifier-building method, as some of these methods (see Section 6.9) allow the construction of classifiers with a definite slant toward one or the other style. DPC is thus suitable when documents become available at different moments in time, e.g., in filtering e-mail. CPC is instead suitable when (i) a new category c|C|+1 may be added to an existing set C = {c1 , . . . , c|C| } after a number of documents have already been classified under C, and (ii) these documents need to be reconsidered for classification under c|C|+1 (e.g., Larkey [1999]). DPC is used more often than CPC, as the former situation is more common than the latter. Although some specific techniques apply to one style and not to the other (e.g., the proportional thresholding method discussed in Section 6.1 applies only to CPC), this is more the exception than the rule: most of the techniques we will discuss allow the construction of classifiers capable of working in either mode.
2.4. “Hard” Categorization Versus Ranking Categorization

There are two different ways of using a text classifier. Given d j ∈ D, we might want to find all the ci ∈ C under which it should be filed (document-pivoted categorization—DPC); alternatively, given ci ∈ C, we might want to find all the d j ∈ D that should be filed under it (category-pivoted

While a complete automation of the TC task requires a T or F decision for each pair d j , ci , a partial automation of this process might have different requirements. For instance, given d j ∈ D a system might simply rank the categories in C = {c1 , . . . , c|C| } according to their estimated appropriateness to d j , without taking any “hard” decision on any of them. Such a ranked list would be of great help to a human expert in charge of taking the final categorization decision, since she could thus restrict the choice to the category (or categories) at the top of the list, rather than having to examine the entire set. Alternatively, given ci ∈ C a system might simply rank the documents in D according to their estimated appropriateness to ci ; symmetrically, for
ACM Computing Surveys, Vol. 34, No. 1, March 2002.

1. Other applications we do not explicitly discuss are speech categorization by means of a combination of speech recognition and TC [Myers et al. and some of these may be considered special cases of others. see. author identification for literary texts of unknown or disputed authorship [Forsyth 1999]. creation date. Automatic Indexing for Boolean Information Retrieval Systems TC goes back to Maron’s [1961] seminal work on probabilistic text classification. This may be the case when the quality of the training data (see Section 4) is low. document type or format. In digital libraries. APPLICATIONS OF TEXT CATEGORIZATION 5 automated identification of text genre [Kessler et al. this assignment is done by trained human indexers.g. 2000. etc. often consisting of a thematic hierarchical thesaurus (e. and is thus a costly activity. however. it has been used for a number of different applications. one is usually interested in tagging documents by metadata that describes them under a variety of aspects (e. These two modalities are sometimes called category-ranking TC and documentranking TC [Yang 1999]. Semiautomated. Field 1975. text indexing is an instance of TC.2. Heaps 1973. that is. Since then. Maron 1961] is that of automatic document indexing for IR systems relying on a controlled dictionary. 1997]. many of the algorithms we will discuss naturally lend themselves to ranking TC too (more details on this in Section 6. the most prominent example of which is Boolean systems. the NASA thesaurus for the aerospace discipline. No. its role is to describe the semantics of the document by means of .1). or when the training documents cannot be trusted to be a representative sample of the unseen documents that are to come. Some of this metadata is thematic. 3.. language identification for texts of unknown language [Cavnar and Trenkle 1994]. or the MESH thesaurus for medicine). multimedia document categorization through the analysis of textual captions [Sable and Hatzivassiloglou 2000].Machine Learning in Automated Text Categorization classification under ci a human expert would just examine the top-ranked documents instead of the entire document set. and Tzeras and Hartmann [1993]. Note that the borders between the different classes of applications listed here are fuzzy and somehow artificial. ACM Computing Surveys. and automated essay grading [Larkey 1998]. The application that has spawned most of the early research in the field [Borko and Bernick 1963. and are the obvious counterparts of DPC and CPC. note that this application may typically require that k1 ≤ x ≤ k2 key words are assigned to each document. Schapire and Singer 2000]. Recalling Section 2. Gray and Harley 1971. so that the results of a completely automatic classifier could not be trusted completely. of which we here briefly review the most important ones. Automatic indexing with controlled dictionaries is closely related to automated metadata generation. “interactive” classification systems [Larkey and Croft 1996] are useful especially in critical applications in which the effectiveness of a fully automated system may be expected to be significantly lower than that of a human expert.g. Usually.. respectively. Robertson and Harding [1984]. Document-pivoted TC is probably the best option. Vol. Fuhr and Knorz [1984]. 3. so that new documents may be classified as they become available. Various text classifiers explicitly conceived for document indexing have been described in the literature. March 2002. and may thus be addressed by the automatic techniques described in this paper. k2 . availability. In these latter each document is assigned one or more key words or key phrases describing its content. we will deal with “hard” classification. for given k1 . If the entries in the controlled vocabulary are viewed as categories. 34. where these key words and key phrases belong to a finite set called controlled dictionary. In the rest of the paper.). unless explicitly mentioned. for example. 1.

categorized under categories such as Personals. and thus tackled by means of TC techniques. the relevant and the irrelevant. be it for purposes of personal organization or structuring of a corporate document base. A filtering system may be installed at the producer end. No. in which case it must block the delivery of documents deemed uninteresting to the consumer. an e-mail filter might be trained to discard “junk” mail [Androutsopoulos et al. while in the latter case (which is the more common. Real Estate. Filtering can be seen as a case of single-label TC.g. A typical case is a newsfeed. or the automatic grouping of conference papers into sessions. 1995]. so as to allow journalists specialized in individual sports to access only documents of prospective interest for them. . The generation of this metadata may thus be viewed as a problem of document indexing with controlled dictionary. and is updated by the system by using feedback information provided (either implicitly or explicitly) by the user on the relevance or nonrelevance of the delivered messages. the system builds and updates a “profile” for each consumer [Liddy et al. some authors [Hull 1994. many other issues pertaining to document organization and filing. Politics.3. when. 1999] and further classify nonjunk mail into topical categories of interest to the user. in which case it must route the documents to the interested consumers only. 2000. the automatic filing of newspaper articles under the appropriate sections (e. Text Filtering Text filtering is the activity of classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer [Belkin and Croft 1992]. prior to publication. 3. document filtering has a tradition dating back to the ’60s. Additionally. 1998. key words or key phrases.6 bibliographic codes. In general. Hull et al. Indexing with a controlled vocabulary is an instance of the general problem of document base organization. Schapire ¨ et al. in the example above. 34. Document Organization Sebastiani a filtering system may also further classify the documents deemed relevant to the consumer into thematic categories. depending on whether documents have to be ranked in decreasing order of estimated relevance or just accepted/rejected..).2. For instance. use the term “filtering” in place of the more appropriate term “categorization. this is called adaptive filtering. 1. Cars for Sale. Newspapers dealing with a high volume of classified ads would benefit from an automatic system that chooses the most suitable category for a given ad. at the offices of a newspaper incoming “classified” ads must be. somewhat confusingly. etc. 1990]. and to which we will refer in the rest of this section) a single profile is needed. since this latter is a completely general TC task. 1996. thereby resembling a standing IR query. in the case of a sports newspaper). In the former case. Batch filtering thus coincides with single-label TC under |C| = 2 categories. may be addressed by TC techniques. In the TREC community [Lewis 1995c]. 1994].. that is. all news not concerning sports. the classification of incoming documents into two disjoint categories. 3. Drucker et al. while the case in which no userspecified profile is available is called either routing or batch filtering. Lifestyles. In this case. addressed by systems of various degrees of automation and dealing with the multiconsumer case discussed ACM Computing Surveys. A profile may be initially specified by the user. all articles about sports should be further classified according to which sport they deal with. where the producer is a news agency and the consumer is a newspaper [Hayes et al. March 2002.g. Schutze et al. the filtering system should block the delivery of the documents the consumer is likely not interested in (e. etc. or at the consumer end. Other possible applications are the organization of patents into categories for making their search easier [Larkey 1999]. Vol. Similarly. Home News.” In information science.

it is typically the case that each category must be populated by a set of k1 ≤ x ≤ k2 documents. and word choice selection in machine translation. or sites. Information filtering by ML techniques is widely discussed in the literature: see Amati and Crestani [1999]. for example. given the occurrence in a text of an ambiguous (i. [1998]. Tauritz et al. Other examples. [1999].. polysemous or homonymous) word.Machine Learning in Automated Text Categorization above. CPC should be chosen so as to allow new categories to be added and obsolete ones to be deleted. each corresponding to a branching decision at an internal node. it was called selective dissemination of information or current awareness (see Korfhage [1997. see Roth [1998] for an introduction. [1993]. which are nowadays being used in contexts such as the creation of personalized Web newspapers. [2000]. Chakrabarti et al. [2000]) once we view word occurrence contexts as documents and word senses as categories. ¨ [1998b]. Hierarchical Categorization of Web Pages 7 3. Techniques exploiting this intuition in a TC context have been presented by Dumais and Chen [2000]. WSD is very important for many applications. junk e-mail blocking. and Usenet news selection. Chapter 6]).e. Techniques exploiting this intuition in a TC context have been presented by Attardi et al. WSD may be seen as a TC task (see Gale et al. this is a single-label TC case. and Oh et al. Vol. under the hierarchical catalogues hosted by popular Internet portals. prepositional phrase attachment. It is thus a WSD task to decide which of the above senses the occurrence of bank in Last week I borrowed some money from the bank has. Unlike in the previous applications. bank may have (at least) two different senses in English. ACM Computing Surveys. automatic Web page categorization has two essential peculiarities: (1) The hypertextual nature of the documents: Links are a rich source of information. [2002]. McCallum et al. Quite obviously. [1998a]. and Weigend et al. by decomposing the classification problem into a number of smaller classification problems. one of the most important problems in computational linguistics. Ruiz and Srinivasan [1999]. Iyer et al. and Yu and Lam [1998]. Word Sense Disambiguation Word sense disambiguation (WSD) is the activity of finding.5. No. (2) The hierarchical structure of the category set: This may be used. When Web documents are catalogued in this way. including natural language processing. the sense of this particular word occurrence. . 34. For instance. Classifying Web pages automatically has obvious advantages. part of speech tagging. Koller and Sahami [1997]. G¨ vert o et al. With respect to previously discussed TC applications. as in the Bank of England (a financial institution) or the bank of river Thames (a hydraulic engineering artifact). [1998]. [2000]. Kim et al. are context-sensitive spelling correction. [2000]. and indexing documents by word senses rather than by words for IR purposes. 1. as they may be understood as stating the relevance of the linked page to the linking page. [2000] and experimentally compared by Yang et al. March 2002.4. which may all be tackled by means of TC techniques along the lines discussed for WSD. The explosion in the availability of digital information has boosted the importance of such systems. and one in which document-pivoted TC is usually the right choice. rather than issuing a query to a generalpurpose Web search engine a searcher may find it easier to first navigate in the hierarchy of categories and then restrict her search to a particular category of interest. Chakrabarti et al. TC has recently aroused a lot of interest also for its possible application to automatically classifying Web pages. Escudero et al. since the manual categorization of a large enough subset of the Web is infeasible. [1999]. Furnkranz [1999]. 3. WSD is just an example of the more general issue of resolving natural language ambiguities.

On the other hand. 1990]. the rules must be manually defined by a knowledge engineer with the aid of a domain expert (in this case. at least in the research community (see Mitchell [1996] for a comprehensive introduction to ML). No. since the learning process is “supervised” by the knowledge of the categories and of the training instances that belong to them. The most famous example of this approach is the CONSTRUE system [Hayes et al. However. automatic construction of a classifier from a set of manually classified documents. a general inductive process (also called the learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified ¯ under ci or ci by a domain expert. This means that if a learner is (as it often is) available off-the-shelf. iff it satisfies at least one of the clauses. Vol. THE MACHINE LEARNING APPROACH TO TEXT CATEGORIZATION In the ’80s.2 The advantages of the ML approach over the KE approach are evident. the inductive process gleans the characteristics that a new unseen document should have in order to be classified under ci . all that is needed is the inductive. 34. the results above do not allow us to state that these effectiveness results may be obtained in general. key words are indicated in italic. In ML terminology. set of categories).90 “breakeven” result (see Section 7) on a subset of the Reuters test collection. That is. one per category. of type if DNF formula then category . The same happens if a 2 Within the area of content-based document management tasks. the document is classified under category iff it satisfies the formula. a figure that outperforms even the best classifiers built in the late ’90s by state-of-the-art ML techniques.e. 1. from these characteristics. [1994]). by means of knowledge engineering (KE) techniques. categories are indicated in SMALL CAPS (from Apt´ et al. As argued by Yang [1999]. the most popular approach (at least in operational settings) for the creation of automatic document classifiers consisted in manually building. an example of an unsupervised learning activity is document clustering (see Section 1). a different domain expert needs to intervene and the work has to be repeated from scratch. but of an automatic builder of classifiers (the learner). A sample rule of the type used in CONSTRUE is illustrated in Figure 1. Since the early ’90s. it was originally suggested that this approach can give very good effectiveness results: Hayes et al. . the classification problem is an activity of supervised learning. The engineering effort goes toward the construction not of a classifier. the ML approach to TC has gained popularity and has eventually become the dominant one. an expert in the membership of documents in the chosen set of categories): if the set of categories is updated. In this approach. e 4. and it is not clear whether this was a randomly chosen or a favorable subset of the entire Reuters collection. The drawback of this approach is the knowledge acquisition bottleneck well known from the expert systems literature. March 2002. [1990] reported a . and if the classifier is ported to a completely different domain (i. 1. no other classifier has been tested on the same dataset as CONSTRUE. then these two professionals must intervene again. ACM Computing Surveys.. Rule-based classifier for the WHEAT category. that is. an expert system capable of taking TC decisions. built by Carnegie Group for the Reuters news agency.8 if ((wheat & farm) (wheat & commodity) (bushels & export) (wheat & tonnes) (wheat & winter & ¬ soft)) or or or or then WHEAT else ¬ WHEAT Sebastiani Fig. A DNF (“disjunctive normal form”) formula is a disjunction of conjunctive clauses. Such an expert system would typically consist of a set of manually defined logical rules.

or if the classifier is ported to a completely different domain. . not necessarily of equal size: —a training(-and-validation) set T V = {d 1 . . ci ∈ × C. . . once a classifier has been built it is desirable to evaluate its effectiveness.e. Training Set. . T ek and then iteratively applying the train-and-test approach on pairs T Vi = −Tei . . the values of the total function ˘ : D × C → {T. d |T V | } (sometimes called a hold-out set). and Validation Set 9 built by observing the characteristics of these documents. . and the classifier decisions (d j . . . . 1. the results of the previous evaluation would be a pessimistic estimate of the real performance. it is often the case that the internal parameters of the classifiers must be tuned by testing which values of the parameters yield the best effectiveness. . ci ) values match the ˘ (d j . on which the repeated tests of the classifier aimed The ML approach relies on the availability of an initial corpus = {d 1 .. ci ) = T . prior to classifier construction the initial corpus is split in two sets. . In order to make this optimization possible. .e. Test Set. In the most favorable case. . . In an operational setting. F } are known for every pair d j . Each d j ∈ Te is fed to the classifier. Vol. . . . The documents in T e cannot participate in any way in the inductive construction of the classifiers. In this case. to select instances of it) than intensionally (i. The final effectiveness figure is obtained by individually computing the effectiveness of 1 . —a test set Te = {d |T V |+1 . . An alternative is the k-fold crossvalidation approach (see Mitchell [1996]. after evaluation has been performed one would typically retrain the classifier on the entire initial corpus. In fact. k . ci ) are compared with the expert decisions ˘ (d j . Classifiers built by means of ML techniques nowadays achieve impressive levels of effectiveness (see Section 7). A measure of classification effectiveness is based on how often the (d j . In both approaches. they are already available. this typically happens for organizations that start a categorization activity and opt for an automated modality straightaway. since it is easier to characterize a concept extensionally (i. since the final classifier has been trained on more data than the classifier evaluated. page 146). . The ML approach is more convenient than the KE approach also in this latter case.1. In the ML approach. from which the classifier is built. . this typically happens for organizations which have previously carried out the same categorization activity manually and decide to automate the process. ci ). . . 4. c|C| } is inductively ACM Computing Surveys. . and the evaluation would thus have no scientific character [Mitchell 1996. . d |T V | }. . Tei . . . or to describe a procedure for recognizing its instances). if this condition were not satisfied. d |Tr| }. ci ) = F .. k are built by partitioning the initial corpus into k disjoint sets T e1 . in order to boost effectiveness. A document d j is a positive example of ci if ˘ (d j . d |T V | } is further split into a training set Tr = {d 1 . . in the train-and-test approach the set {d 1 . . . . d | | } ⊂ D of documents preclassified under C = {c1 . to describe the concept in words. used for testing the effectiveness of the classifiers. The less favorable case is when no manually classified documents are available. The classifier for categories C = {c1 . 34. the experimental results obtained would likely be unrealistically good. . in which k different classifiers 1 . . ci ) values. the preclassified documents are then the key resource. . . making automatic classification a qualitatively (and not only economically) viable alternative to manual classification. a negative example of ci if ˘ (d j . In this case. . c|C| }. d | | }.Machine Learning in Automated Text Categorization classifier already exists and the original set of categories is updated. No. March 2002. . page 129]. This is called the train-and-test approach. and then averaging the individual results in some way. it is easier to manually classify a set of documents than to build and tune a set of rules. . . . . and a validation set Va = {d |Tr|+1 . . In research settings (and in most operational settings too). That is. . .

2. respectively. 2001. | | Sebastiani approaches to (1) and (3) are also used. DOCUMENT INDEXING AND DIMENSIONALITY REDUCTION 5. . (2) IR-style techniques (such as document-request matching. query reformulation. how much term tk contributes to the semantics of document d j . Because of this. induction. This is often called either the set of words or the bag of words approach to document representation. Dumais et al. 34. .1. 5. for the same reason why we do not test a classifier on the documents it has been trained on. Indexing. This includes thus any document submitted to the classifier in the operational phase.3 Given a corpus . Vol. where T is the set of terms (sometimes called features) that occur at least once in at least one document of Tr. validation set generality g Va (ci ). (3) IR-style evaluation of the effectiveness of the classifiers is performed. the obvious variant may be used in the k-fold cross-validation case. In a number of experiments [Apt´ e et al. March 2002. Texts cannot be directly interpreted by a classifier or by a classifier-building algorithm. . one may define the generality g (ci ) of a category ci as the percentage of documents that belong to ci . 1994. and evaluation are the themes of Sections 5. IR techniques are used in three phases of the text classifier life cycle: (1) IR-style indexing is always performed on the documents of the initial corpus and on those to be classified during the operational phase. we do not test it on the documents it has been optimized on: test set and validation set must be kept separate.4 and a text d j is usually represented as a vector of term weights d j = w1 j . and as such it shares many characteristics with other IR tasks such as text search. Similarly to what happens in IR. Lewis 1992a]. Note that. an indexing procedure that maps a text d j into a compact representation of its content needs to be uniformly applied to training. thereby confirming similar results from IR 4 An exception to this is represented by learning approaches based on hidden Markov models [Denoyer et al. . No. The various approaches to classification differ mostly for how they tackle (2). 4. ci ) = T }| . . w|T | j . 1998. we will take the freedom to use the expression “test document” to denote any document not in the training set and validation set. in TC this latter problem is usually disregarded. and 0 ≤ wk j ≤ 1 represents. 6 and 7. it has been found that representations more sophisticated than this do not yield significantly better effectiveness. although in a few cases nonstandard 3 From now on. and test set generality g Te (ci ) of ci may be defined in the obvious way. . (2) different ways to compute term weights.10 at parameter optimization are performed. loosely speaking. that is: g (ci ) = |{d j ∈ | ˘ (d j . Frasconi et al. validation. Document Indexing The training set generality g Tr (ci ). Information Retrieval Techniques and Text Categorization Text categorization heavily relies on the basic machinery of IR. A typical choice for (1) is to identify terms with words. 1. . Differences among approaches are accounted for by (1) different ways to understand what a term is. and test documents. The choice of a representation for text depends on what one regards as the meaningful units of text (the problem of lexical semantics) and the meaningful natural language rules for the combination of these units (the problem of compositional semantics). .) are often used in the inductive construction of the classifiers. 2002]. depending on whether weights are binary or not. ACM Computing Surveys. The reason is that TC is a content-based document management task.

the phrase is not grammatically such. Most of the times. page 40]. binary weights may be used (1 denoting presence and 0 absence of the term in the document). the standard tfidf function is used (see Salton and Buckley [1988]). and for ease of exposition we will assume they always do. although indexing languages based on phrases have superior semantic qualities. where a “crude” syntactic method was complemented by a statistical filter (only those syntactic phrases that occurred at least three times in the positive examples of a category ci were retained). Schutze et al. 11 Lewis et al.1] interval and for the documents to be represented by vectors of equal length. [1996]). they have inferior statistical qualities with respect to word-only indexing languages: a phrase-only indexing language has “more terms. the less discriminating it is. the phrase is such according to a grammar of the language (see Lewis [1992a]). and lower document frequency for terms” [Lewis 1992a. the semantics of a document is reduced to the collective lexical semantics of the terms that occur in it. 1. 1995. thereby deeming of null importance the order in which the terms occur in the document and the syntactic role they play. or —statistically. and (ii) the more documents a term occurs in. Lewis [1992a] argued that the likely reason for the discouraging results is that. Mladeni´ and Grobelnik 1998]. In particular. irrespectively of whether the notion of “phrase” is motivated —syntactically. whether binary or nonbinary weights are used depends on the classifier learning algorithm used. (1) #Tr (tk ) where #(tk . [2001]). that differ from each other in terms of logarithms. Tzeras and Hartmann 1993]. As a special case. 34. rather than individual words. the more it is representative of its content. lower consistency of assignment (since synonymous terms are not assigned to the same documents). It is likely that the final word on the usefulness of phrase indexing in TC has still to be told. d j ) denotes the number of times tk occurs in d j . see Salton and Buckley [1988] and Singhal et al. Vol. although perhaps to a smaller degree.5 Note that this formula (as most other indexing formulae) weights the importance of a term to a document in terms of occurrence considerations only. weights usually range between 0 and 1 (an exception is ACM Computing Surveys. In order for the weights to fall in the [0. that is. they also apply to statistically motivated ones. No. Although his remarks are about syntactically motivated phrases. In the case of nonbinary indexing. March 2002. but is composed of a set/sequence of words whose patterns of contiguous occurrence in the collection are statistically significant (see Caropreso et al. Formula 1 is just one of the possible instances of this class.Machine Learning in Automated Text Categorization [Salton and Buckley 1988]. 2001. some authors have used phrases. the number of documents in Tr in which tk occurs. In other words. This function embodies the intuitions that (i) the more often a term occurs in a document. for determining the weight wkj of term tk in document d j any IR-style indexing technique that represents a document as a vector of weighted terms may be used. 1991. that is. that is. d j ) · log |Tr| . the weights resulting from tfidf are often 5 There exist many variants of tfidf. and investigations in this direction are still being actively pursued [Caropreso et al. . but the experimental results found to date have not been uniformly encouraging. A combination of the two approaches is probably the best way to go: Tzeras and Hartmann [1993] obtained significant improvements by using noun phrases obtained through a combination of syntactic and statistical criteria. more synonymous or nearly synonymous terms. defined as tfidf (tk . as indexing terms ¨ [Fuhr et al. normalization or other correction factors. d j ) = #(tk . and #Tr (tk ) denotes the document frequency of term tk . [1996] for variations on this theme. c As for issue (2). thereby disregarding the issue of compositional semantics (an exception are the representation techniques used for FOIL [Cohen 1995a] and SLEEPING EXPERTS [Cohen and Singer 1999]).

the recent tendency is to adopt it. Tzeras and Hartmann 1993]. and has had important theoretical spin-offs in the field of probabilistic indexing [Fuhr 1989. the first 20 lines of the summary. in this case. phrases. in a patent categorization application Larkey [1999] indexed only the title. 1. [2000].3]. 7 The AIR/X system. One application of TC in which it would be inappropriate to remove function words is author identification for documents of disputed paternity. the removal of function words (i.. This system is the final result of the AIR project. as it reduces both the dimensionality of the term space (see Section 5. Vol. conjunctions. and Riloff [1995]). where terms may be single words. identifying the most relevant part of a document is instead a nonobvious task. 1991] occupies a special place in the literature on indexing for TC.. Baker and McCallum [1998]). In fact. The Darmstadt Indexing Approach . No. 1994. and the section containing 6 The AIR/X system [Fuhr et al. stems. similarly to unsupervised term clustering (see Section 5. other indexing functions have also been used. prepositions. Cohen and Singer 1999. page 589. 5. its suitability to TC is controversial. including probabilistic techniques [G¨ vert et al. 1999] or o techniques for indexing structured documents [Larkey and Croft 1996]. [1996].1) of which it is an instance. When documents are flat. Sebastiani the claims of novelty of the described invention. and is thus a synonym of TC (the DIA was later extended to indexing with free terms [Fuhr and Buckley 1991]). In contrast. “it is often the ‘little’ words that give an author away (for example. While the former option is the rule.. “indexing” is used in the sense of Section 3. d j ) |T | s=1 (tfidf (ts . grouping words that share the same morphological root). e Weiss et al.3) and the stochastic dependence between terms (see Section 6. one can pay extra importance to the words it contains [Apt´ et al. stemming has sometimes been reported to hurt effectiveness (e.5.2).2. 34.7 The approach to indexing taken in AIR/X is known as the Darmstadt Indexing Approach (DIA) [Fuhr 1985]. one of the most important efforts in the history of TC: spanning a duration of more than 10 years [Knorz 1982. Here. and its experiments have also been richly documented in a series of papers and doctoral theses written in German. as in adaptive filtering. The idea that underlies the DIA is the use of a much wider set of “features” than described in Section 5.5. (2) d j ))2 Although normalized tfidf is the most popular one. Section 4. or (see Sections 5. either the full text of the document or selected parts of it are indexed.g. Although. documents.1.1 and 5. Depending on the application.1.” ACM Computing Surveys. Before indexing.12 normalized by cosine normalization. an application of AIR/X to indexing physics literature).e. approximations of tfidf are usually employed [Dagan et al. given by wkj = tfidf (tk . Nigam et al. . This approach was made possible by the fact that documents describing patents are structured. its applications (including the AIR/PHYS system [Biebricher et al. the abstract. 1997. Fuhr and Buckely 1991].5. topic-neutral words such as articles.e. 1988]. that is. Functions different from tfidf are especially needed when Tr is not available in its entirety from the start and #Tr (tk ) cannot thus be computed. the DIA considers properties (of terms. For instance. 1999].2) combinations of any of these. March 2002. The interested reader may consult Fuhr et al. Similarly. exceptions exist. the relative frequencies of words like because or though). it has produced a system operatively employed since 1985 in the classification of corpora of scientific literature of O(105 ) documents and O(104 ) categories. All other approaches mentioned in this paper view terms as the dimensions of the learning space. when a document title is available. ¨ as noted in Manning and Schutze [1999]. as using terms from a controlled vocabulary. [1991] for a detailed bibliography.) is almost always performed (exceptions include Lewis et al.6 Concerning stemming (i. etc.

or documents is achieved. In fact. Vol. Experiments have shown that. the training set generality of ci . derived from the training set. and is thus independent of specific terms. while typical algorithms used in text retrieval (such as cosine matching) can scale to high values of |T |.g. —properties of the relationship between a term tk and a document d j : for example. the reduction process must be performed with care. Various DR methods have been proposed. or pairwise relationships among these) as basic dimensions of the learning space. if DR is performed. or the certainty with which a phrase was identified in a document.8 Relevance description vectors rd (d j .g. the t f of tk in d j . categories.. 5. Fuhr and Buckley [1991. the set vector space from |T | to |T | T is called the reduced term set. DR is also beneficial since it tends to reduce overfitting. the length of d j . or the location (e. in removing terms the risk is to remove potentially useful information on the meaning of the documents. 34. that is. For new TC applications dealing with structured documents or categorization of Web pages.. but much worse at classifying previously unseen data. It is then clear that. or documents (for multivalued features. categories. given that it contains term tk (the DIA association factor). This means that. 1. for example. in the title.Machine Learning in Automated Text Categorization categories. The main advantage of this approach is the possibility to consider additional features that can hardly be accounted for in the usual term-based approaches.3. The essential ideas of the DIA— transforming the classification space by means of abstraction and using a more detailed text representation than the standard bag-of-words approach—have not 8 13 been taken up by other researchers so far. the idf of tk . in order to avoid overfitting a number of training examples roughly proportional to the number of terms used is needed. ci )). the location of a term within a document. Because of this. No.e. March 2002. —properties of a category ci : for example. . either from the information theory or from the linear algebra literature. However. For each possible document-category pair. before classifier induction one often applies a pass of dimensionality reduction (DR). The term-category relationship is described by estimates. the values of these features are collected in a so-called relevance description vector rd(d j . —properties of a document d j : for example. and their relative merits have been tested by experimentally evaluating the variation ACM Computing Surveys. Classifiers that overfit the training data are good at reclassifying the data they have been trained on. appropriate aggregation functions are applied in order to yield a single value to be included in rd(d j . whose effect is to reduce the size of the |T |. these ideas may become of increasing importance. Examples of these are —properties of a term tk : e. Dimensionality Reduction Association factors are called adhesion coefficients in many early papers on TC. of the probability P (ci | tk ) that a document belongs to category ci . page 235] have suggested that 50– 100 training examples per term may be needed in TC tasks. or in the abstract) of tk within dj.g. Unlike in text retrieval. ci ) are then the final representations that are used for the classification of document d j under category ci . overfitting may be avoided even if a smaller amount of training examples is used. The size of this vector is determined by the number of properties considered. in this way an abstraction from specific terms. in order to obtain optimal (cost-)effectiveness.. the phenomenon by which a classifier is tuned also to the contingent characteristics of the training data rather than just the constitutive characteristics of the categories. the LLSF algorithm of Yang and Chute [1994]). ci ). Robertson and Harding [1984]. the large value of |T |) may be problematic. the same does not hold of many sophisticated learning algorithms used for classifier induction (e. in TC the high dimensionality of the term space (i. see Field [1975].

[1996] have used a socalled wrapper approach. Yang and Pedersen [1997] have shown that TSR may even result in a moderate (≤5%) increase in effectiveness. moreover. yields the highest effectiveness. with |Ti | e sification under ci (see Apt´ et al. the set T of |T |) that. Lewis and Ringuette [1994]. March 2002. from the original set T . Ng et al. orthogonal distinction may be drawn in terms of the nature of the resulting terms: —DR by term selection: T is a subset of T . a set Ti of |T |. 1994]. depending on the classifier. Unlike in the previous distinction. Yang [1999].. keeping the |T | receive the highest score according to a function that measures the “importance” of the term for the TC task. Mladeni´ c [1998]. with |T |. |T | terms that that is. that is.5. [1995]). A computationally easier alternative is the filtering approach [John et al. This means that different subsets of d j are used when working with the different categories. [1997]. that is. . Starting from an initial term set. for each individual category) or globally: —local DR: for each category ci . only the terms that occur in the highest number of documents are retained. . c|C| } (see Caropreso et al. we will assume that the global approach is used. when used terms (with |T | for document indexing. and on the TSR technique used. We will explore this solution in the rest of this section.14 in effectiveness that a given classifier undergoes after application of the function to the term space. these two ways of doing DR are tackled by very different techniques. This distinction usually does not impact on the choice of DR technique. the sheer size of the space of different term sets makes its cost-prohibitive for standard TC applications. different numbers of terms for different categories may be chosen. if local DR is performed. if the terms in T are words. Vol. Yang and Pedersen [1997]). [1994]. although everything we will say also applies to the local approach. . since most such techniques can be used (and have been used) for local and global DR alike (supervised DR techniques—see Section 5. the terms in T may not be words at all). Wiener et al. When a new term set is generated. techniques for term selection (also called term space reduction—TSR) attempt to select. Li and Jain [1998]. on the |T aggressivity |T || of the reduction. . Schutze et al. is chosen for the classifica|T | tion under all categories C = {c1 . The term set that results in the best effectiveness is chosen. a new term set is generated by either adding or removing a term. 5. However. —global DR: a set T of terms. depending on whether the task is performed locally (i. This approach has the advantage of being tuned to the learning algorithm being used. 34. —DR by term extraction: the terms in T are not of the same type of the terms in T (e. In a series of experiments Yang and Pedersen [1997] have shown that with #Tr (tk ) it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness (a ACM Computing Surveys. [2001]. 5. we will address them separately in the next two sections.g.. In the rest of this section.4.4. There are two distinct ways of viewing DR. A simple and effective global TSR function is the document frequency #Tr (tk ) of a term tk . is chosen for clasterms.1. depending on whether a category is or is not easily separable from the others. Moulinier et al. 1. Dimensionality Reduction by Term Selection Sebastiani Given a predetermined integer r.e. Sable and ¨ Hatzivassiloglou [2000]. . No. A second.1—are exceptions to this rule). a classifier based on it is built and then tested on a validation set. but are obtained by combinations or transformations of the original ones. Document Frequency. one in which T is identified by means of the same learning method that will be used for building the classifier [John et al. Typical values are 10 ≤ |Ti | ≤ 50. [1995]. 1994].

In DR we measure how independent tk and ci are.g. Yang and Pedersen 1997. .9 Here. P (t k . The terms tk 9 For better uniformity Table I views all the TSR functions of this section in terms of subjective probability. Li and Jain 1998. This seems to indicate that the terms occurring most frequently in the collection are the most valuable for TC. this means that by reducing the term set by a factor of 10 using document frequency. As such. Li and Jain 1998]. 1998. term tk does not occur in x and x belongs to category ci ). Moulinier and Ganascia c ACM Computing Surveys. [1994]. The formulae refer to the “local” (i. 1995. Larkey 1998.. Lewis and Ringuette 1994. Moulinier et al.Machine Learning in Automated Text Categorization reduction by a factor of 100 bringing about just a small loss). since this function is not usually viewed in probabilistic terms. since they had originally been given names that might generate some confusion if used here. Of course. Other more sophis- 15 ticated information-theoretic functions have been used in the literature. c Finally.. NGL coefficient [Ng et al.” categoryindependent sense. in order to assess the value of a term tk in a “global. 1998. March 2002. ci ) this is slightly artificial. Yang and Liu 1999]. information gain [Caropreso et al. ¨ 2001. Note that the NGL and GSS coefficients are here named after their authors. 34. In some cases such as χ 2 (tk . 1995] or before applying another more sophisticated form [Dumais et al. for a random document x. 2001.. No. 1997. 1996. Vol. Mladeni´ 1998. Yang and Pedersen 1997. Baker and McCallum [1998]) to 5 (e. Apt´ et al. 1. probabilities are interpreted on an event space ¯ of documents (e. Yang and Liu 1999]. relevancy score [Wiener et al. Sebastiani et al. Ruiz and Srinivasan c 1999]. according to which the terms with low-to-medium document frequency are the most informative ones [Salton and Buckley 1988]. 1996. note that a slightly more empirical form of TSR by document frequency is adopted by many authors. ci ) are usually computed. odds ratio [Caropreso et al. Galavotti et al.g. either the sum |C| f sum (tk ) = i=1 f (tk . [1997]. with popular values for x ranging from 1 (e. Joachims [1997]). 2001. or the weighted |C| sum f wsum (tk ) = i=1 P (ci ) f (tk . ci ). Ruiz and Srinivasan 1999. Yang and Pedersen 1997]. category-specific) forms of the functions. Lam et al. lest only topic-neutral words are retained [Mladeni´ 1998]. in the experimental sciences χ 2 is used to measure how the results of an observation differ (i. 2000. ci ) of their category-specific values f (tk .. either as the only form of DR [Maron 1961. Mladeni´ 1998. chi-square [Caropreso et al. But these two results do not contradict each other. since it is well known (see Salton et al. interpretations of this principle vary across different functions. Ittner et al. and GSS coefficient [Galavotti et al. 1991]. this would seem to contradict a well-known “law” of IR. among them the DIA association factor [Fuhr et al. However.g. 2000. These functions try to capture the intuition that the best terms for ci are the ones distributed most differently in the sets of positive and negative examples of ci . Cohen e [1995a]). The mathematical definitions of these measures are summarized for convenience in Table I. only such words are removed. and are estimated by counting occurrences in the training set. are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence)..g. Lewis and Ringuette 1994. 2000]. Larkey and Croft 1996. Schutze et al. For instance. [1975]) that the large majority of the words occurring in a corpus have a very low document frequency.e.e.4. or the |C| maximum f max (tk ) = maxi=1 f (tk . while the words from low-to-medium to high document frequency are preserved. Other Information-Theoretic Term Selection Functions. A variant of this policy is removing all terms that occur at most x times in the training set (e. 1995]. Taira and Haruno 1999.2. 1997. stop words need to be removed in advance. who remove all terms occurring in at most x training documents (popular values for x range from 1 to 3).. Dagan et al. Ruiz and Srinivasan 1999]. mutual information [Dumais et al. 5. Lewis 1992a. which again is slightly artificial in some cases. ci ). ci ) denotes the probability that. All functions are specified “locally” to a specific category ci .

where “>” means “performs better than. ci ) ¯ with the lowest value for χ 2 (tk . ci ) − P (tk . since we are interested in the terms which are not. ci ) GSS(tk . . . ci ) · P (t k . NGLsum . Mladeni´ 1998. ci ) χ 2 (tk . page 44] and Larkey [1998]. ci ) IG(tk . the ultimate word on its value is the effectiveness it brings about. ). ci ) or χmax (tk . Yang and Pedersen [1997] have shown that. ci ) P (tk ) · P (ci ) ¯ ¯ ¯ |Tr| · [P (tk . different initial corpora. No. ci ) · P (t k . due to the pervasive problems of polysemy. MIwsum }. For instance. 5. ci } t∈{tk . ci ) · P (t k . March 2002. Yang and Pedersen c 1997].g. Vol. t k } ¯ P (ci | tk ) P (t. ci ) − P (tk . c i ) Formula. and that more general statements on the relative merits of these functions could be made only as a result of comparative experiments performed in thoroughly controlled conditions and on a variety of different situations (e. ci ) ¯ c∈{ci . ci ) · P (t k . ci ) log P (tk . the experiments reported in the above-mentioned papers seem to indicate that {ORsum . ci ) is highest. Function Denoted by Mathematical form DIA association factor Information gain z(tk . ci ) RS(tk .16 Sebastiani Table I. Main Functions Used for Term Space Reduction Purposes. homonymy. and Is Used Under This Name by Lewis [1992a. from the original set T . a set T of “synthetic” terms that maximize effectiveness. . and synonymy. term extraction attempts to generate. GSSmax } > 2 2 {χmax . it should be noted that these results are just indicative. ci ) NGL(tk . The rationale for using synthetic (rather than naturally occurring) terms is that. 2001..5. c) P (t) · P (c) Mutual information Chi-square NGL coefficient MI(tk . 2000. with various classifiers and various initial corpora.” However. Dimensionality Reduction by Term Extraction Given a predetermined |T | |T |. ci ) − P (tk . Methods for term extraction try to solve these problems by creating artificial terms that do not suffer from them. ci ) are thus the most independent from ci . ci ) OR(tk . ci )]2 ¯ ¯ P (tk ) · P (t k ) · P (ci ) · P (ci ) ¯ ¯ ¯ ¯ |Tr| · [P (tk . sophisticated techniques 2 such as IGsum (tk . Collectively. we select the terms for which χ 2 (tk . In these experiments most functions listed in Table I (with the possible exception of MI) have improved on the results of document frequency. Information Gain Is Also Known as Expected Mutual Information. . the original terms may not be optimal dimensions for document content representation. different classifiers. While each TSR function has its own rationale. Any term extraction method consists in (i) a method for extracting the new terms from the ACM Computing Surveys. 1. Galavotti et al. ci ) · P (t k . c) · log P (t. d Is a Constant Damping Factor. ci ) · P (t k . IGsum } > {χwavg } {MImax . Various experimental comparisons of TSR functions have thus been carried out [Caropreso et al. 34. ci )] ¯ ¯ P (tk ) · P (t k ) · P (ci ) · P (ci ) ¯ log P (tk | ci ) + d ¯ P (t k | ci ) + d ¯ Relevancy score Odds ratio GSS coefficient P (tk | ci ) · (1 − P (tk | ci )) ¯ (1 − P (tk | ci )) · P (tk | ci ) ¯ ¯ ¯ ¯ P (tk . In the RS (t k . ci ) can reduce the dimensionality of the term space by a factor of 100 with no loss (or even with a small increase) of effectiveness.

intuitively interpretable. possibly due to a disappointing performance by the clustering method: as Lewis [1992a. 1990]) is a DR technique developed in IR in order to address the problems deriving from the use of synonymous. consists in creating clusters of two terms that are one the most similar to the other according to some measure of similarity. since the former tends to address terms synonymous (or near-synonymous) with other terms. the small size of their experiment (see Section 6. independent dimensions. 5. In practice. For instance. with economic impact of a document that was indeed a positive Some term selection methods.Machine Learning in Automated Text Categorization old ones. By using this technique in the context of a hierarchical clustering algorithm.S.5. However. and even showed some effectiveness improvement with less aggressive levels of reduction. and (ii) a method for converting the original document representations into new representations based on the newly synthesized dimensions. Term Clustering. also address the problem of redundancy.10 Lewis [1992a] was the first to investigate the use of term clustering in TC. Their experiments.2. they witnessed only a marginal effectiveness improvement. and polysemous words as dimensions of document representations. ACM Computing Surveys. “The relationships captured in the clusters are mostly accidental. while the latter targets noninformative terms.11) hardly allows any definitive conclusion to be reached.000. 1. [1995. 17 5. The method he employed. they work well in bringing out the “latent” semantic structure of the vocabulary ¨ used in the corpus. Baker and McCallum [1998] provided instead an example of supervised clustering. page 235] discussed the classification under category Demographic shifts in the U. His results were inferior to those obtained by single-word indexing.” Li and Jain [1998] viewed semantic relatedness between words in terms of their co-occurrence and co-absence within training documents.5. unlike in term selection and term clustering. carried out in the context of a Na¨ve Bayes classifier (see ı Section 6. rather than the systematic relationships that were hoped for. so that the groups (or their centroids. Later experiments by Slonim and Tishby [2001] have confirmed the potential of supervised clustering methods for term extraction. Term clustering tries to group words with a high degree of pairwise semantic relatedness. as the distributional clustering method they employed clusters together those terms that tend to indicate the presence of the same category. One characteristic of LSI is that the newly obtained dimensions are not. Schutze et al. Term clustering is different from term selection. No. called reciprocal nearest neighbor clustering. nearsynonymous. LSI infers the dependence among the original terms from a corpus and “wires” this dependence into the newly obtained. Latent semantic indexing (LSI—[Deerwester et al. however. or group of categories. such as wrapper methods. 34. March 2002. or a representative of them) may be used instead of the terms as dimensions of the vector space. Two term extraction methods have been experimented with in TC. namely term clustering and latent semantic indexing.1. showed only a 2% effectiveness loss with an aggressivity of 1. page 48] said. Vol. This technique compresses document vectors into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of cooccurrence. In TC this technique is applied by deriving the mapping function from the training set and then applying it to training and test documents alike. . Latent Semantic Indexing. since clustering is not affected by the category labels attached to the docu10 ments. Both Lewis [1992a] and Li and Jain [1998] are examples of unsupervised clustering.2). The function mapping original vectors into new vectors is obtained by applying a singular value decomposition to the matrix formed by the original document vectors.

thus creating several category-specific LSI representations. and (ii) for global DR. the quite revealing sentence The nation grew to 249. Weigend et al.” A drawback of LSI. For other TC works that have used LSI or similar term extraction techniques. 34. represents the evidence for the fact that d j ∈ ci . but was correct when the same task was tackled by means of LSI. . “if there is a great number of terms which all contribute a small amount of critical information. given a document d j . We start by discussing the general form that a text classifier has. Their experiments showed the former approach to perform better than the latter. 1]. ¨ Schutze et al. but quite possibly the words contained in it have concurred to produce one or more of the LSI higher-order terms that generate the document space of the category. Their experiments showed LSI to be far more effective than χ 2 for the first two techniques. 6. . that is. and that contained. 1. F }. Vol. Wiener et al.18 test instance for the category. . The CSVi function takes up different meanings according to the learning method used: for instance. Here we will deal only with the methods that have been most popular in TC. The latter consists instead in the definition of a function CSVi : D → [0. This works for “document-ranking TC”. roughly speaking.2 ı CSVi (d j ) is defined in terms of a probability. followed by the definition of a threshold τi such that CSVi (d j ) ≥ τi is interpreted ACM Computing Surveys.6 million people in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West. see Hull [1994]. is that if some original term is particularly good in itself at discriminating a category. c|C| }. though. a number between 0 and 1 which. [1995. Let us recall from Section 2. . This well exemplifies how LSI works: the above sentence does not contain any of the 200 terms most relevant to the category selected by χ 2 . The inductive construction of a ranking classifier for category ci ∈ C usually consists in the definition of a function CSVi : D → [0. and neural networks). but we will also briefly mention the existence of alternative. among others. linear discriminant analysis. . INDUCTIVE CONSTRUCTION OF TEXT CLASSIFIERS The inductive construction of text classifiers has been tackled in a variety of ways. The former consists in the definition of a function CSVi : D → {T. [1995] experimentally compared LSI-based term extraction with χ 2 -based TSR using three different classifier learning techniques (namely. returns a categorization status value for it. while both methods performed equally well for the neural network classifier. thus creating a single LSI representation for the entire category set. in the “Na¨ve Bayes” approach of Section 6. whereas in the “Rocchio” approach discussed in Section 6. [1999]. March 2002.7 CSVi (d j ) is a measure of vector closeness in |T |-dimensional space. its CSVi scores for the different categories in C = {c1 . 1] that.4 that there are two alternative ways of viewing classification: “hard” (fully automated) classification and ranking (semiautomated) classification. logistic regression. The construction of a “hard” classifier may follow two alternative paths. ¨ As Schutze et al. and Yang [1995]. then the combination of evidence is a major problem for a term-based classifier. that discrimination power may be lost in the new vector space. [1995] used LSI in two alternative ways: (i) for local DR. analogous to the one used for ranking classification. “category-ranking TC” is usually tackled by ranking. Sebastiani ¨ Schutze [1998]. page 230] put it. No. for a given document d j . Documents are then ranked according to their CSVi value. and both approaches to perform better than simple TSR based on Relevancy Score (see Table I). The classifier decision was incorrect when local DR had been performed by χ 2 -based term selection retaining the top original 200 terms. less standard approaches. Li and Jain [1998].

2 to 6. This is often used. 34. one has to revert to the latter method. as it might happen that d is classified under ci .1. The most important distinction is whether the threshold is derived analytically or experimentally. and CSVi (d ) < CSVi (d ). 1999].k. we thus defer the discussion of this policy (which is called probability thresholding in Lewis [1995a]) to Section 7. 1998.1.1. For ease of exposition we will not discuss them. Lewis 1992a.2) and whose effectiveness is computed by decision-theoretic measures such as utility (see Section 7. this is not a thresholding policy in the sense defined at the beginning of Section 6.1. discussing a number of approaches that have been applied in the TC literature. Determining Thresholds 19 There are various policies for determining the threshold τi . No. it is also called Scut in Yang [1999]. Lewis and Ringuette 1994. 6. also depending on the constraints imposed by the application. This is typical of classifiers that output probability estimates of the membership of d j in ci (see Section 6.Machine Learning in Automated Text Categorization as T while CSVi (d j ) < τi is interpreted as F . depending on the application. We call this policy CSV thresholding 11 Alternative methods are possible. and found fixed thresholding to be consistently inferior to the other two policies. However. Different τi ’s are typically chosen for the different ci ’s. such as training a classifier for which some standard. The former method is possible only in the presence of a theoretical result that indicates how to compute the threshold that maximizes the expected value of the effectiveness function [Lewis 1995a].3. “k-per-doc” thresholding [Lewis 1992a] or Rcut [Yang 1999]) is applied. will focus on the methods for classifier learning rather than on the effectiveness and efficiency of the classifiers built by means of them. also called Pcut in Yang [1999]. The presentation of the algorithms will be mostly qualitative rather than quantitative. March 2002. The fact that these latter results have been obtained across different classifiers no doubt reinforces them. Yang [1999] found instead CSV thresholding to be superior to proportional thresholding (possibly due to her categoryspecific optimization on a validation set). that is. Strictly speaking. Wiener et al.1.11 The definition of thresholds will be the topic of Section 6. for instance. In general we will assume we are dealing with “hard” classification. it suffers from a certain coarseness. Quite clearly. A second. Schapire et al. In his experiments Lewis [1992a] found the proportional policy to be superior to probability thresholding when microaveraged effectiveness was tested but slightly inferior with macroaveraging (see Section 7. and embodies the principle that the same percentage of documents of both training and test set should be classified under ci . it will be evident from the context how and whether the approaches can be adapted to ranking classification. predefined value such as 0 is the threshold. this policy does not lend itself to documentpivoted TC. In general. which consists in testing different values for τi on a validation set and choosing the value which maximizes effectiveness. a fixed thresholding policy (a. [Cohen and Singer 1999. 1995]. Wiener et al. Sometimes. Lam et al. however.a. This policy consists in choosing the value of τi for which g Va (ci ) is closest to g Tr (ci ). this will instead be the focus of Section 7. Larkey 1998. 1.1). For obvious reasons. the choice of the thresholding ACM Computing Surveys. 1995]. in applications of TC to automated document indexing [Field 1975.3). equal for all d j ’s. are to be assigned to each document d j . When such a theoretical result is not known. . aside from the considerations above.12 we will instead concentrate on the definition of CSVi . whereby it is stipulated that a fixed number k of categories. popular experimental policy is proportional thresholding [Iwayama and Tokunaga 1995. d is not. Vol. In Sections 6. as the fact that k is equal for all documents (nor could this be otherwise) allows no finetuning. this policy is mostly at home with document-pivoted TC.

and compute this probability by an application of Bayes’ theorem. . w|T | j of (binary or weighted) terms belongs to ci . given by P (ci | d j ) = P (ci )P (d j | ci ) P (d j ) . Li and Jain [1998]. 1. when viewed as random variables. If we plug c in (4) and (5) into (3) and take logs we obtain log P (ci | d j ) |T | w Probabilistic classifiers (see Lewis [1998] for a thorough discussion) view CSVi (d j ) in terms of P (ci |d j ). k=1 log(1 − pki¯) − log P (d j ). ¯ k=1 P (wk j | ci ) ACM Computing Surveys. for instance. Robertson and Harding [1984]). .12 ci and its complement ci . this independence assumption is encoded by the equation |T | = log P (ci ) + k=1 |T | wk j log pki 1 − pki (6) + k=1 log(1 − pki ) − log P (d j ) log(1 − P (ci | d j )) |T | = log(1 − P (ci )) + k=1 |T | wk j log pki ¯ 1 − pki¯ (7) + P (wk j | ci ). that is. No. Lewis [1992a]. the ası sumption needed here is instead the weaker linked dependence assumption. the probability that a document represented by a vector d j = w1 j . and account for most of the probabilistic approaches to TC in the literature (see Joachims [1998]. Probabilistic Classifiers Sebastiani Sahami [1997]. 34. In this case. quite obviously. (5) 1 − pki We may further observe that in TC the document space is partitioned into two ¯ categories. and P (ci ) the probability that a randomly picked document belongs to ci . the P (wk j | ci ) factors of (4) may be written as P (wk j | ci ) = pkik j (1 − pki )1−wk j wk j pki = (1 − pki ). 6. . in applying a text classifier to document indexing for Boolean systems a fixed thresholding policy might be chosen. (3) In (3) the event space is the space of documents: P (d j ) is thus the probability that a randomly picked document has vector d j as its representation. Lewis and Gale [1994]. . which results from using binary-valued vector representations for documents. Koller and Cooper [1995] has pointed out that in this case the full independence assumption of (4) is not actually made in the Na¨ve Bayes classifier. such that P (¯ i | d j ) = 1 − P (ci | d j ). since the number of possible vectors d j is too high (the same holds for P (d j ).20 policy may also be influenced by the application. Larkey and Croft [1996].2. The estimation of P (d j | ci ) in (3) is problematic. P (d j | ci ) = k=1 (4) 12 Probabilistic classifiers that use this assumption are called Na¨ve Bayes clası sifiers. while a proportional or CSV thresholding method might be chosen for Web page classification under hierarchical catalogues. One of the best-known Na¨ve Bayes apı proaches is the binary independence classifier [Robertson and Sparck Jones 1976]. statistically independent of each other. Vol. March 2002. but for reasons that will be clear shortly this will not concern us). . In order to alleviate this problem it is common to make the assumption that any two coordinates of the document vector are. not verified in practice. which may be written as P (d j | ci ) P (d j | ci ) ¯ = |T | P (wk j | ci ) . The “na¨ve” character of ı the classifier is due to the fact that usually this assumption is. if we write pki as short for P (wkx = 1 | ci ).

—to relax the independence assumption. very high or very low) for long documents (i.. but is problematic in probabilistic ones (see Lewis [1998]. Recently. if the “fixed thresholding” method of Section 6. among these are the ones aiming —to relax the constraint that document vectors should be binary-valued. A recent paper by Lewis [1998] is an excellent roadmap on the various directions that research on Na¨ve ı Bayes classifiers has taken. Guthrie et al. the fact that the binary independence assumption seldom harms effectiveness has also been given some theoretical justification [Domingos and Pazzani 1997].e. This This is not true. Note that in general the classification of a given document does not require one to compute a sum of |T | factors.13 Defining a classifier for category ci thus basically requires estimating the 2|T | parameters { p1i . Salton and Buckley [1988]) accounting for the “importance” of tk for d j play a key role in IR. Vol. [1998] for a ACM Computing Surveys. p|T |i } from the training ¯ ¯ data. 1998a. and this accounts for the vast majority of them. p1i . an assumption even more implausible than (4). as noted by Lewis [1998]. vanRijsbergen [1977]) have not shown the performance improvements that were hoped for. 1998. since this produces classifiers of higher computational cost and characterized by harder parameter estimation problems [Koller and Sahami 1997]. Unlike other types of classifiers. since document vectors are usually very sparse. documents such that wk j = 1 for many values of k). 1994]. has the drawback that different occurrences of the same word within the same document are viewed as independent.e. all the factors for which wk j = 0 may be disregarded. and may thus be used directly as CSVi (d j ). McCallum et al. however. which may be done in the obvious way. 13 P (ci | d j ) 1−P (ci | d j ) is an increasing mono- looks natural. thus obtaining log P (ci | d j ) 1 − P (ci | d j ) P (ci ) + 1 − P (ci ) |T | |T | 21 = log + wk j log k=1 pki (1 − pki ) ¯ pki (1 − pki ) ¯ (8) log k=1 1 − pki . Earlier efforts in this direction within probabilistic text search (e.g. This may be the hardest route to follow. In fact. Note also |T | 1− pki P (ci ) that log 1−P (ci ) and k=1 log 1− pki are ¯ constant for all documents. This accounts for document length naturally but... We may convert (6) and (7) into a single equation by subtracting componentwise (7) from (6).1 is adopted. Chakrabarti et al. for a fixed document d j the first and third factor in the formula above are different for different categories. |T | pki (1− pki ) ¯ as the presence of k=1 wk j log pki (1− pki ) ¯ would imply. . . given that weighted indexing techniques (see Fuhr [1989]. 34. . Section 5). . The method we have illustrated is just ı one of the many variants of the Na¨ve Bayes approach. The value of log 1−P (c | d ) tends i j to be more extreme (i. the literature on probabilistic classifiers is inextricably intertwined with that on probabilistic search systems (see Crestani et al. One possible answer is to switch from an interpretation of Na¨ve ı Bayes in which documents are events to one in which terms are events [Baker and McCallum 1998. p|T |i . 1. in fact.Machine Learning in Automated Text Categorization where we write pki as short for ¯ ¯ P (wkx = 1 | ci ). thus calling for length normalization. The quotation of text search in the last paragraph is not casual. and may thus be disregarded. —to introduce document length normalP (ci | d j ) ization. No. Taking length into account is easy in nonprobabilistic approaches to classification (see Section 6. and may therefore influence the choice of the categories under which to file d j . March 2002. the common denominator of which is (4). irrespectively of their semantic relatedness to ci . .7). 1 − pki ¯ Note that tonic function of P (ci | d j ).

. [1991]).3. and as such have sometimes been criticized since. 34. and most DT approaches to TC have made use of one such package. (ii) if not. nonnumeric) algorithms. [1999]. . since these latter attempt to determine the probability that a document falls in the category denoted by the query. C4. effective as they may be. and leafs are labeled by categories. 6. A decision tree (DT) text classifier (see Mitchell [1996]. numeric) in nature. Most such classifiers use binary document representations. a notion essentially involving supervised learning.e. selecting a term tk . March 2002.5 (used by Cohen and Hirsh [1998]. and C5 (used by Li and Jain [1998]). There are a number of standard packages for DT learning. Joachims [1998]. [1998]. and since they are the only search systems that take relevance feedback. A possible method for learning a DT for category ci consists in a “divide and conquer” strategy of (i) checking whether all the training examples have the same ¯ label (either ci or ci ).4) and decision tree learners are the most important examples. A decision tree equivalent to the DNF rule of Figure 1. Lewis and Ringuette [1994]. and Lewis and Catlett [1994]). A class of algorithms that do not suffer from this problem are symbolic (i. branches departing from them are labeled by tests on the weight that the term has in the test document. they are not easily interpretable by humans. as central. Edges are labeled by terms and leaves are labeled by categories (underlining denotes negation). partitioning Tr into classes of documents that have the same value for tk . Vol. and thus consist of binary trees.. and Weiss et al. until a leaf node is reached.e. Cohen and Singer [1999]. and placing each such class in a separate subtree. 1. Such a classifier categorizes a test document d j by recursively testing for the weights that the terms labeling the internal nodes have in vector d j . among which inductive rule learners (which we will discuss in Section 6. No. 2.22 Sebastiani Fig. review). Decision Tree Classifiers Probabilistic methods are quantitative (i. The process is recursively repeated on the subtrees until each ACM Computing Surveys. Among the most popular ones are ID3 (used by Fuhr et al. An example DT is illustrated in Figure 2. TC efforts based on experimental DT packages include Dumais et al. Chapter 3) is a tree in which internal nodes are labeled by terms. the label of this node is then assigned to d j .

but obviously scores high in terms of overfitting. March 2002.. or as baseline classifiers [Cohen and Singer 1999.e. Lewis and Catlett 1994.g.. This set of clauses is already a DNF classifier for ci . DNF rule learners vary widely in terms of the methods. Lewis and Ringuette 1994]. as some branches may be too specific to the training data. RIPPER [Cohen 1995a. removing premises from clauses. where η1 . . for removing the overly specific branches.e. [1995]). No. Initially. ηn → γi . Vol. DNF rules are similar to DTs in that they can encode any Boolean function. Cohen and Hirsh 1998. Cohen [1995a] has extensively compared PL and FOL learning in TC (for instance. arbitrarily nested if-then-else clauses) instead of DNF rules. 1996]. . “divide-and-conquer” strategy. Joachims 1998]. of the type illustrated in Figure 1. rules that correctly classify all the training examples) the “best” one 14 according to some minimality criterion. Regression denotes ACM Computing Surveys.14 The literals (i. DL-ESC [Li and Yamanishi 1999]. . we will disregard the issue. obtainable through the use of inductive logic programming methods. The learner applies then a process of generalization in which the rule is simplified through a series of modifications (e. Lewis and Gale [1994]. SCAR [Moulinier et al. However. Variations on this basic schema for DT learning abound [Mitchell 1996. Among the DNF rule learners that have been applied to TC are CHARADE [Moulinier and Ganascia 1996].Machine Learning in Automated Text Categorization leaf of the tree so generated contains training examples assigned to the same category ci . While the methods above use rules of propositional logic (PL). Rule learning methods usually attempt to select from all the possible covering rules (i. ¨ Schutze et al. Cohen and Singer 1999]. that is. where the ability to correctly classify all the training examples is traded for more generality. . . or as members of classifier committees [Li and Jain 1998. comparing the PL learner RIPPER with its FOL version FLIPPER). . a “pruning” phase similar in spirit to that employed in DTs is applied. e and SWAP-1 [Apt´ 1994]. 6. 34. and has found that the additional representational power of FOL brings about only modest benefits. Ittner et al. The key step is the choice of the term tk on which to operate the partition.4. ηn are the terms contained in d j and γi equals ci or ci according to whether ¯ d j is a positive or negative example of ci . which is then chosen as the label for the leaf. . heuristics and criteria employed for generalization and pruning.e. while the clause head denotes the decision to classify d j under ci . 1999]. DT text classifiers have been used either as the main classification tool [Fuhr et al. possibly negated keywords) in the premise denote the presence (nonnegated keyword) or absence (negated keyword) of the keyword in the test document d j . Schapire and Singer 2000. such a “fully grown” tree may be prone to overfitting. an advantage of DNF rule learners is that they tend to generate more compact classifiers than DT learners. 1. Regression Methods Many inductive rule learning algorithms build decision lists (i. or merging clauses) that maximize its compactness while at the same time not affecting the “covering” property of the classifier. a choice which is generally made according to an information gain or entropy criterion. that is. [1995]. 6. every training example d j is viewed as a clause η1 .5. Most DT learning methods thus include a method for growing the tree and one for pruning it. Various TC efforts have used regression models (see Fuhr and Pfeifer [1994]. research has also been carried out using rules of firstorder logic (FOL).. . Decision Rule Classifiers 23 A classifier for category ci built by an inductive rule learning method consists of a DNF rule. Section 3]. Weiss et al.. of a conditional rule with a premise in disjunctive normal form (DNF). . 1991. However. While DTs are typically built by a topdown. At the end of this process. DNF rules are often built in a bottom-up fashion. since the former may always be rewritten as the latter.

one example of a batch method is linear discriminant analysis. Vol. and O is the |C| × |Tr| matrix whose columns are the output vecˆ tors of the training documents.7. and such that CSVi (d j ) corresponds to the dot |T | product k=1 wki wk j of d j and ci . given its input vector I (d j ). and an output vector O(d j ) of |C| weights representing the categories (the weights for this latter vector are binary for training documents. 2 wki · A linear classifier for category ci is a vector ci = w1i . incremental) methods build a classifier soon after examining the first training document. Within the TC literature. is that the cost of comˆ puting the M matrix is much higher than that of many other competitors in the TC arena. the foremost example of a batch method is the Rocchio method. The experiments of Yang and Chute [1994] and Yang and Liu [1999] indicate that LLSF is one of the most effective text classifiers known to date. In LLSF. In this section we will instead concentrate on on-line methods. . 1995]. 34. 6.24 the approximation of a real-valued (instead than binary. because of its importance in the TC literature. ¨ Schutze et al. which means in turn that once a linear classifier has been built. I is the |T | × |Tr| matrix whose columns are the input vectors of the training documents. this will be discussed separately in Section 6.6. Methods for learning linear classifiers are often partitioned in two broad classes. batch methods and on-line methods. and can therefore be adapted to doing TC with a linear classifier. Batch methods build a classifier by analyzing the training set all at once. classification can be performed by invoking such an engine. each document d j has two vectors associated to it: an input vector I (d j ) of |T | weighted terms. and incrementally refine it as they examine new ones. Practically all search engines have a dot product flavor to them. w|T |i belonging to the which represents the cosine of the angle α that separates the two vectors. 1. This may be an advantage in the applications in which Tr is not available in its entirety from the start. Classification may thus be seen as the task of determining an output vector O(d j ) for test document d j . . V F = i=1 j =1 vi j represents the so-called Frobenius norm of a |C| × |T | matrix. . d j ) = cos(α) = |T | k=1 |T | k=1 wki · wk j |T | k=1 2 wk j . The M matrix is usually computed by performing a singular value decomposition on the training set. LLSF computes the matrix from the training data by computing a linear leastsquares fit that minimizes the error on the ˆ training set according to the formula M = arg min M MI − O F . the dot product between the two vectors corresponds to their cosine similarity. that is: S(ci . and its generic entry mik repreˆ sents the degree of association between category ci and term tk . as for example. in adaptive ACM Computing Surveys. March 2002. or in which the “meaning” of the category may change in time.a. though. One of its disadvantages. the Linear Least-Squares Fit (LLSF) applied to TC by Yang and Chute [1994]. . Here we will describe one such model. a model of the stochastic dependence between terms that relies on the covariance matrices of the categories [Hull 1994. This is the similarity measure between query and document computed by standard vectorspace IR engines. hence. building a classifier boils down to computing ˆ ˆ a |C| × |T | matrix M such that MI(d j ) = O(d j ). as in the case of classification) function ˘ by means of a function that fits the training data [Mitchell 1996. Note that when both classifier and document weights are cosine-normalized (see (2)). No. page 236]. and are nonbinary CSV s for test documents). . However. where arg min M (x) stands as usual for the M for which x is def |C| |T | 2 minimum.k. On-line (a. On-Line Methods Sebastiani same |T |-dimensional space in which documents are also represented.

6.g.. a refinement of it called EXPONENTIATED GRADIENT (both applied for the first time to TC in [Lewis et al. while in the perceptron and in POSITIVE WINNOW the wki weights are always positive. In this algorithm. 1. and (ii) update the weights corresponding to all terms (instead of just active ones). by α1 and α2 . active terms have their wki weight − promoted and their wki weight demoted. [1997] and Ng et al. Other on-line methods for building text classifiers are WIDROW-HOFF.e. 1997] is a further variant of POSITIVE WINNOW. A simple on-line method is the perceptron algorithm. The perceptron classifier has shown a good effectiveness in all the experiments quoted above. first applied to TC by ¨ Schutze et al. If the result of the classification is correct. . . . No. [1997].Machine Learning in Automated Text Categorization filtering. c|C| }.7. and subsequently used by Dagan et al. nothing is done. 1997]. instead of adding. a version of BALANCED WINNOW. the final weight wki used in computing the dot prod+ − uct is the difference wki − wki .’s own version of) BALANCED WINNOW. while for the latter the vector d j representing the test document is used as a query against the set of classifiers {c1 . This has obvious advantages in terms of interpretability. the weights of the classifier are modified: if d j was a positive example of ci . the classifier for ci is first initialized by setting all weights wki to the same positive value. Key differences with the previously described algorithms are that these three algorithms (i) update the classifier not only after misclassifying a training example. as a query against the set of test documents. adaptive filtering) in which we may expect the user of a classifier to provide feedback on how test documents have been classified. then the weights wki of “active terms” (i. 1996]) and SLEEPING EXPERTS [Cohen and Singer 1999]. which differs from perceptron because two different constants α1 > 1 and 0 < α2 < 1 are used for promoting and demoting weights. [1995] and Wiener et al. respectively. A multiplicative variant of it is POSITIVE WINNOW [Dagan et al. For the former the classifier ci is used. Linear classifiers lend themselves to both category-pivoted and documentpivoted TC. promotions and demotions are as in POSITIVE WINNOW). in a standard search engine. The Rocchio Method Some linear classifiers consist of an explicit profile (or prototypical document) of the category. whereas in the case of a negative instance + − it is wki that gets denoted while wki gets promoted (for the rest. POSITIVE WINNOW showed a better effectiveness than perceptron but was in turn outperformed by (Dagan et al. the fact that a weight wki is very low means that tk has negatively contributed to the classification process so far. while if d j was a negative example of ci then the same weights are “demoted” by decreasing them by α.. a neural network . semiautomated classification. the second and third are multiplicative. while if it is wrong. and may thus be discarded from the representation. BALANCED WINNOW allows negative wki weights. 25 + − wki and wki for each term tk . March 2002. BALANCED WINNOW [Dagan et al. We may then see the perceptron algorithm (as all other incremental learning methods) as allowing for a sort of “onthe-fly term space reduction” [Dagan et al. as such a profile is more readily understandable by a human than. Note that when the classifier has reached a reasonable level of effectiveness. 34. as in this case further training may be performed during the operating phase by exploiting user feedback. This is also apt to applications (e. Section 4. [1997].4]. The perceptron is an additive weightupdating algorithm. When a training example d j (represented by a vector d j of binary weights) is examined. Following the misclassification of a positive in+ stance. and because promotion and demotion are achieved by multiplying. 1997. say. the terms tk such that wkj = 1) are “promoted” by increasing them by a fixed quantity α > 0 (called learning rate). Vol. in which the classifier consists of two weights ACM Computing Surveys. the classifier built so far classifies it. While the first is an additive weight-updating algorithm. . In experiments conducted by Dagan et al. [1995]. but also after classifying it correctly.

This method is quite easy to implement.g. March 2002.1. instead. The role of negative examples is usually deempha- the Rocchio formula to profile extraction is whether the set NEGi should be considered in its entirety. or as a baseline classifier [Cohen and Singer 1999. . since nearpositives are the most difficult documents to tell apart from the positives. In this formula. [1995]. by setting β to a high value and γ to a low one (e.g. and is also quite efficient. Sable and Hatzivassiloglou 2000. 1997]. and its distance from the centroid of the negative training examples. Ittner et al. and it is perhaps the only TC method rooted in the IR tradition rather than in the ML one. the profile of ci is the centroid of its positive training examples. or as a member of a classifier committee [Larkey and Croft 1996] (see Section 6. More generally.11). One issue in the application of wkj − |POSi | wkj . 1. Cohen and Singer [1999]. Using near-positives corresponds to the query zoning method proposed for IR by Singhal et al. [1998]. 6. Joachims 1997.7. Joachims 1998. should be selected from it. ci ) = F }. In terms of effectiveness. 1996. since learning a classifier basically comes down to averaging weights. . Schapire and Singer ¨ 2000. This adaptation was first proposed by Hull [1994]. No. Schutze et al. when the original ACM Computing Surveys. For instance. Note that even most of the positive training examples would not be classified correctly by the classifier. and NEGi = {d j ∈ Tr | ˘ (d j . a drawback is that if the documents in the category tend to occur in disjoint clusters (e. Galavotti et al. a profile of ci is a weighted list of the terms whose presence or absence is most useful for discriminating ci . This method originates from the observation that. Lewis et al. Learning a linear classifier is often preceded by local TSR. as all linear classifiers. 1998. Singhal et al. 34. . w|T |i for category ci by means of the formula wki = β · {d j ∈POSi } Sebastiani sized.. This situation is graphically depicted in Figure 3(a). 1995. as the centroid of these documents may fall outside all of these clusters (see Figure 3(a)). |NPOSi | γ · {d j ∈NPOSi } w kj The {d j ∈NPOSi } |NPOSi | factor is more sigwkj nificant than {d j ∈NEGi } |NEGi | . where documents are classified within ci if and only if they fall within the circle. . in this case. [1995]). and has been used by many authors since then. Rocchio’s method computes a classifier ci = w1i . yielding wki = β · {d j ∈POSi } wkj − |POSi | wkj . β and γ are control parameters that allow setting the relative importance of positive and negative examples. [1997]. POSi = {d j ∈ Tr | ˘ (d j .26 classifier. . such a classifier may miss most of them. either as an object of research in its own right [Ittner et al. a classifier built by the Rocchio method. Vol. The Rocchio method is used for inducing linear. or whether a wellchosen sample of it. if β is set to 1 and γ to 0 (as in Dumais et al. and Joachims [1997] use β = 16 and γ = 4).. 2000. Hull [1994]. a set of newspaper articles lebeled with the Sports category and dealing with either boxing or rockclimbing). Schutze et al. ci ) = T }. Enhancements to the Basic Rocchio Framework. A classifier built by means of the Rocchio method rewards the closeness of a test document to the centroid of the positive training examples. Schapire et al. ¨ Joachims [1998]. It relies on an adaptation to TC of the wellknown Rocchio’s formula for relevance feedback in the vector-space model. profile-style classifiers. such as the set NPOSi of near-positives (defined as “the most positive among the negative training examples”). |NEGi | γ · {d j ∈NEGi } where wkj is the weight of tk in document d j . has the disadvantage that it divides the space of documents linearly. 1995].

Ittner et al.8. [1998] issue a query. and Weigend et al. Yang 1999]. previously considered an underperformer [Cohen and Singer 1999. Note that. Joachims 1998. Vol. ¨ Lewis et al. whereby the term weights of a training document are loaded into the input units. the output unit(s) represent the category or categories of interest. the activation of these units is propagated forward through the network. its term weights wkj are loaded into the input units.. where the input units represent terms. [1995] instead equate the near-positives of ci to the positive examples of the sibling categories of ci . 34. [1998] have found that a Rocchio classifier can achieve ACM Computing Surveys. Schapire et al. respectively. and the weights on the edges connecting units represent dependence relations. [1999]. as in the application they work on (TC with hierarchically organized category sets) the notion of a “sibling category of ci ” is well defined. March 2002. 3. . [1995]) generally did not make a distinction between near-positives and generic negatives. Early applications of the Rocchio formula to TC (e. These recent results will no doubt bring about a renewed interest for the Rocchio classifier. [1997]. 1. document similarities are here viewed in terms of Euclidean distance rather than. in terms of dot product or cosine. 1996. against a document base consisting of the negative training examples. an effectiveness comparable to that of a state-of-the-art ML method such as “boosting” (see Section 6. 6.g. A typical way of training NNs is backpropagation. and the value of the output unit(s) determines the categorization decision(s). statistical phrases.1) while being 60 times quicker to train. and are then the near-positives. and a method called dynamic feedback optimization). Hull [1994]. Schutze et al. For classifying a test document d j .11. By using query zoning plus other enhancements (TSR. as is more common. as the documents on which user judgments are available are among the ones that had scored highest in the previous ranking. In order to select the nearpositives Schapire et al. near-positives tend to be used rather than generic negatives. and (b) the k-NN classifier. No. and if a misclassification occurs the error is “backpropagated” so as to change the parameters of the network and eliminate or minimize the error. Ruiz and Srinivasan [1999]. A comparison between the TC behavior of (a) the Rocchio classifier. Small crosses and circles denote positive and negative training instances. consisting of the centroid of the positive training examples. Wiener et al. 1995. The big circles denote the “influence area” of the classifier.Machine Learning in Automated Text Categorization 27 Fig. Neural Networks A neural network (NN) text classifier is a network of units. Rocchio formula is used for relevance feedback in IR. for ease of illustration. the top-ranked ones are the most similar to this centroid. A similar policy is also adopted by Ng et al.

since “they defer the decision on how to generalize beyond the training data until each new query instance is encountered” [Mitchell 1996. d z ) represents some measure or semantic relatedness between a test document d j and a training document d z . For deciding whether d j ∈ ci . Yang and Liu 1999] is instead a network with one or more additional “layers” of units. 34. Lam et al.7. d z ) and [[α]] = 1 if α = T . Anyhow. if the answer is positive for a large enough proportion of them. [1995] and Wiener et al. ci )]]. from a ranked IR system may be used for this purpose. Weigend et al. ACM Computing Surveys.9. the former have yielded ¨ either no improvement [Schutze et al. ¨ Ruiz and Srinivasan 1999. be it probabilistic (as used by Larkey and Croft [1996]) or vector-based (as used by Yang [1994]). . Wiener et al. d z ) · [[ ˘ (d z . unlike linear classifiers. 6. This is graphically depicted in Figure 3(b). but rely on the category labels attached to the training documents similar to the test document. Ng et al. Section 8. on a validation set) a threshold k that indicates how many topranked training documents have to be considered for computing CSVi (d j ). k-NN looks at whether the k training documents most similar to d j also are in ci . RSV(d j . and a negative decision is taken otherwise. March 2002. 1997]. Yang and Pedersen [1997]. A nonlinear NN [Lam and Lee 1999. declarative representation of the category ci .6. Larkey and Croft [1996] used k = 20. 1995] or very small improvements [Wiener et al. 1995. Other types of linear NN classifiers implementing a form of logistic regression have also been proposed ¨ and tested by Schutze et al. 1999] has found 30 ≤ k ≤ 45 to yield the best effectiveness. Yang’s is a distance-weighted version of k-NN (see [Mitchell 1996. Vol. Note that k-NN.1 can then be used to convert the realvalued CSVi ’s into binary categorization decisions. does not divide the document space linearly.k. any matching function. various experiments have shown that increasing the value of k does not significantly degrade the performance. which in TC usually represent higher-order interactions between terms that the network is able to learn.1]). Larkey [1998]. Classifying d j by means of k-NN thus comes down to computing CSVi (d j ) = d z ∈ Trk (d j ) RSV(d j .2. Larkey [1999]. 1995] over the latter. yielding very good effectiveness. Masand and colleagues [Creecy et al. The construction of a k-NN classifier also involves determining (experimentally. 1992. Example-Based Classifiers Sebastiani a positive decision is taken. since the fact that a most similar document is in ci is weighted by its similarity with the test document. page 244]. [1999]. No. other examples include Joachims [1998]. 1997. The thresholding methods of Section 6. and Yang and Liu [1999]. Li and Jain [1998]. 1. where the more “local” character of k-NN with respect to Rocchio can be appreciated. Our presentation of the example-based approach will be based on the k-NN (for “k nearest neighbors”) algorithm used by Yang [1994]. 1995. and hence does not suffer from the problem discussed at the end of Section 6. [1995]. When comparative experiments relating nonlinear NNs to their linear counterparts have been performed. Actually.28 The simplest type of NN classifier is the perceptron [Dagan et al. which is a linear classifier and as such has been extensively discussed in Section 6. In (9). 1992]. These methods have thus been called lazy learners. Masand et al.a. 0 if α = F Example-based classifiers do not build an explicit. memory-based reasoning methods) to TC is due to Creecy. The first application of example-based methods (a. (9) where Trk (d j ) is the set of the k documents d z which maximize RSV(d j . 1999. while Yang [1994. Schutze et al.

. . G(kiz )) · |{d j ∈ kiz | ˘ (d j . which is much more expensive.and examplebased methods was presented in Lam and Ho [1998]. with a linear classifier only a dot product needs to be computed to classify a test document. thus obtaining a set of clusters K i = {ki1 .g. ci ) = T }| · |{d j ∈ kiz }| |{d j ∈ kiz }| |Tr| = kiz ∈K i RSV(d j . No. Rocchio. computing CSVi (d j ) = def kiz ∈K i Various example-based techniques have been used in the TC literature. March 2002.” In their WHIRL system they used the scoring function CSVi (d j ) =1− (1 − RSV(d j . −1 if α = F j iz j i represents the where |{d j ∈kiz }| “degree” to which G(kiz ) is a positive in|{d j ∈k }| stance of ci .Machine Learning in Automated Text Categorization This method is naturally geared toward document-pivoted TC. In their experiments this technique outperformed a number of other classifiers. k-NN requires the entire training set to be ranked for similarity with the test document. In this work a k-NN system was fed generalized instances (GIs) in place of training documents. positive instances of ci that “lie out” of the region where most other positive instances of ci are located) in the training set. WIDROW-HOFF). 34. —building a profile G(kiz ) (“generalized instance”) from the documents belonging to cluster kiz by means of some algorithm for learning linear classifiers (e. ci ) = T }| .5 decision tree classifier and the RIPPER CNF rule-based classifier.e. that is. such as a C4. . However. Cohen and Hirsh [1998] implemented an example-based classifier by extending standard relational DBMS technology with “similarity-based soft joins. A combination of profile. since ranking the training documents for their similarity with the test document can be done once for all categories. For example.ci )]] RSV(d j . .1. This is a drawback of “lazy” learning methods. and |T r|iz represents its weight within the entire process. its most important drawback is its inefficiency at classification time: while. ki|K i | }. G(kiz )) · (10) |{d j ∈ kiz | ˘ (d j .. A number of different experiments (see Section 7. 6. ACM Computing Surveys. This approach may be seen as the result of —clustering the training set. .9. for example.3) have shown k-NN to be quite effective. 1. DPC is thus de facto the only reasonable way to use k-NN. this information is not discarded but weights negatively in the decision to classify d j under ci . since they do not have a true training phase and thus defer all the computation to classification time.c )=T }| as an alternative to (9). |Tr| |{d ∈k | ˘ (d . which is obviously clumsy. obtaining a small but statistically significant improvement over a version of WHIRL using (9). This exploits the superior effectiveness (see Figure 3) of k-NN over linear classifiers while at the same time avoiding the sensitivity of k-NN to the presence of “outliers” (i. who reinterpreted (9) by redefining [[α]] as [[α]] = 1 if α = T . For category-pivoted TC. —applying k-NN with profiles in place of training documents. A variant of the basic k-NN approach was proposed by Galavotti et al. [2000]. Other Example-Based Techniques. one would need to store the document ranks for each test document. . 29 The difference from the original k-NN approach is that if a training document d z similar to the test document d j does not belong to ci . d z ))[[ d z ∈Trk (d j ) ˘ (d z . Vol.

. Concerning Issue (i).11. and (ii) a choice of a combination function. as it is the middle element of the widest set of parallel decision surfaces (i. Dumais et al. [1998] have applied a novel algorithm for training SVMs which brings about training speeds comparable to computationally easy learners such as Rocchio. which has also been shown to provide the best effectiveness. when the assumption is made that the categories are linearly separable) with the nonlinear case on a standard benchmark. as there is a theoretically motivated. 1. it may be seen as the attempt to find. The method described is applicable also to the case in which the positives and the negatives are not linearly separable. and Yang and Liu [1999]. A classifier committee is then characterized by (i) a choice of k classifiers. respectively. As argued by Joachims [1998]. . The SVM method chooses the 6. such that the separation property is invariant with respect to the widest possible traslation of σi . Learning support vector classifiers. . 1999] and subsequently used by Drucker et al.k.10. that is. the idea is to apply k different classifiers 1 . and obtained slightly better results in the former case. 4. 34. Yang and Liu [1999] experimentally compared the linear case (namely. . SVMs offer two important advantages for TC: —term selection is often not needed. .. In TC. among those shown. as SVMs tend to be fairly robust to overfitting and can scale up to considerable dimensionalities. its minimum distance to any training example is maximum). k to the same task of deciding whether d j ∈ ci . Decision surface σi (indicated by the thicker line) is. given a task that requires expert knowledge to perform. the σi that separates the positives from the negatives by the widest possible margin. . whereas lines represent decision surfaces. Klinkenberg and Joachims [2000]. Vol. k experts may be better than one if their individual judgments are appropriately combined. No. ensembles) are based on the idea that. Fig. called the support vectors. among all the surfaces σ1 . It is noteworthy that this “best” decision surface is determined by only a small set of training examples. Dumais et al. In the two-dimensional case of Figure 4. [1998]. the best possible one. that is. in which case the decision surfaces are (|T |−1)-hyperplanes. . in |T |-dimensional space that separate the positive from the negative training examples (decision surfaces). Small boxes indicate the support vectors. Taira and Haruno [1999].a. and then combine their outcome appropriately. σ2 . various lines may be chosen as decision surfaces. it is known from the ML literature that. “default” choice of parameter settings. 6. The small crosses and circles represent positive and negative training examples. from the set in which the maximum distance between two elements in the set is highest.e. Building Classifiers by Support Vector Machines The support vector machine (SVM) method has been introduced in TC by Joachims [1998.30 Sebastiani middle element from the “widest” set of parallel lines. In geometrical terms. Classifier Committees Classifier committees (a. March 2002. [1999]. —no human and machine effort in parameter tuning on a validation set is needed. the classifiers ACM Computing Surveys. . in order to guarantee good effectiveness. Dumais and Chen [2000]. This idea is best understood in the case in which the positives and the negatives are linearly separable.

i−1 perform on the training examples. . 1998. Liere and Tadepalli 1997]. various rules have been tested. Specifically. . and the combination of the three classifiers improved an all three pairwise combinations. 1. No. 31 can somehow profit from the complementary strengths of their individual members. ci pair is given an “importance t 1 weight” hi j (where hi j is set to be equal for . However. the combination rules they have worked with are MV. it has to be remarked that the small size of their experiment (two test sets of less than 700 documents each were used) does not allow us to draw definitive conclusions on the approaches adopted. k forming the committee are obtained by the same learning method (here called the weak learner). Boosting. . . Schapire and Singer 2000] occupies a special place in the classifier committees literature. . whereby a weighted sum of the CSVi ’s produced by the k classifiers yields the final CSVi . whereby among committee { 1 . . an example-based ı classifier. whereby the binary outputs of the k classifiers are pooled together. The simplest one is majority voting (MV). March 2002. but their individual contribution is weighted by their effectiveness on the l validation examples most similar to d j [Li and Jain 1998]. and ACC. Classifier committees have had mixed results in TC so far. especially in light of the fact that the committee approach is computationally expensive (its cost trivially amounts to the sum of the costs of the individual classifiers plus the cost incurred for the computation of the combination rule). The classifiers may differ for the indexing approach used. The boosting method [Schapire et al. A still different policy. Another policy is dynamic classifier selection (DCS). and a classifier built by means of their own “subspace method”. in training classifier i one may take into account how classifiers 1 . This seems discouraging. Concerning Issue (ii). a decision tree classifier. whereby the judgments of all the classifiers in the committee are summed together. 6. and by a narrow margin. is adaptive classifier combination (ACC).Machine Learning in Automated Text Categorization forming the committee should be as independent as possible [Tumer and Ghosh 1996]. and are usually optimized on a validation set [Larkey and Croft 1996]. i−1 have performed worst. . F }. . Vol. Again. the small size of the test set used (187 documents) suggests that more experimentation is needed before conclusions can be drawn. The key intuition of boosting is that the k classifiers should be trained not in a conceptually parallel and independent way. and k-NN. . or both. somehow intermediate between WLC and DCS. the avenue which has been explored most is the latter (to our knowledge the only example of the former is Scott and Matwin [1999]). . . Within TC. or for the inductive method. These results would seem to give strong support to the idea that classifier committees ACM Computing Surveys. all together or in ı pairwise combinations. The weights w j reflect the expected relative effectiveness of classifiers j . since the k classifiers 1 . In this way.1. 34. Na¨ve Bayes. and the classification decision that reaches the majority of k+1 votes is taken (k obviously 2 needs to be an odd number) [Li and Jain 1998. This method is particularly suited to the case in which the committee includes classifiers characterized by a binary decision function CSVi : D → {T. DCS. . Li and Jain [1998] have tested a committee formed of (various combinations of) a Na¨ve Bayes classifier. . and concentrate on getting right those examples on which 1 . as in the committees described above. the best individual classifier (for every attempted classifier combination ACC gave better results than MV and DCS). . In their experiments the combination of any two classifiers outperformed the best individual classifier (k-NN). Only in the case of a committee formed by Na¨ve Bayes and the subspace classiı fier combined by means of ACC has the committee outperformed.11. using a WLC rule. A second rule is weighted linear combination (WLC). but sequentially. . for learning classifier t each d j . and its judgment adopted by the committee [Li and Jain 1998]. . k } the classifier t most effective on the l validation examples most similar to d j is selected. Larkey and Croft [1996] have used combinations of Rocchio.

1998. pairs correctly classified by t will have their weight decreased.32 all d j . Tzeras and Hartmann 1993]. An approach similar to boosting was also employed by Weiss et al.MR) is aimed at minimizing ranking loss (i. which will be specially tuned to correctly solve the pairs with higher weight. Iyer et al. and the central notion of TC (namely. [1999]. 1.. genetic algorithms [Clack et al. the most noteworthy are the ones based on Bayesian inference networks [Dumais et al. proving that the system is correct and complete). which represents how hard to get a correct decision for this pair was for classifiers 1 . . Lam et al. inherently nonformalizable. t−1 . Vol. that of membership of a document in a category) is. eventually combined by using the simple MV rule as a combination rule. [2000]. and maximum entropy modelling [Manning ¨ and Schutze 1999]. [2000]. [2000]. Masand 1994]. rather than analytically. ci pairs15 ). the evaluation of document classifiers is typically conducted experimentally. . No. a standard (nonenı hanced) Rocchio classifier. who experimented with committees of decision trees each having an average of 16 leaves (and hence much more complex than the sim- Sebastiani ple “decision stumps” used in Schapire and Singer [2000]). Further experiments by Schapire and Singer [2000] showed ADABOOST to outperform. or have remained somehow isolated attempts. In experiments conducted over three different test collections. a Na¨ve Bayes classifier.12. a weighted linear combination rule is applied to yield the final committee. Li and Jain [1998]. 6. The experimental evaluation of a classifier usually measures its effectiveness (rather than its efficiency). its ability to take the right classification decisions. [1998] also showed that a simple modification of this policy allows optimization of the classifier based on “utility” (see Section 7. simply called ADABOOST in Schapire et al. . that is. at getting a correct category ranking for each individual document). Classifier t is then applied to the training t documents. it is hardly possible to be exhaustive.g. . [1998] have shown ADABOOST to outperform SLEEPING EXPERTS. Other Methods Although in the previous sections we have tried to give an overview as complete as possible of the learning approaches proposed in the TC literature. Some of the learning approaches adopted do not fall squarely under one or the other class of algorithms. The reason is that. [2000]. [1998]) is explicitly geared toward the maximization of microaveraged effectiveness. 1997.MH was presented in Sebastiani et al. aside from SLEEPING EXPERTS. similarly to boosting. ACM Computing Surveys. while pairs misclassified by t will have their weight increased. Schapire et al.MH.1. and as a result weights hi j t+1 are updated to hi j . Boosting-based approaches have also been employed in Escudero et al.. due to its subjective character. in order to evaluate a system analytically (e. 34. A boosting algorithm based on a “committee of classifier subcommittees” that improves on the effectiveness and (especially) the efficiency of ADABOOST. 7. a mechanism for emphasising documents that have been misclassified by previous decision trees is used. and Myers et al. [2000]. March 2002. Among these. using a one-level decision tree weak learner. .. After all the k classifiers have been built.g. with respect to what correctness and completeness are defined). we would need a formal specification of the problem that the system is trying to solve (e. whereas the latter (ADABOOST. EVALUATION OF TEXT CLASSIFIERS 15 Schapire et al. The former algorithm (ADABOOST. two different boosting algorithms are tested. in this update operation. and Joachims’ [1997] PRTFIDF classifier. a classifier that had proven quite effective in the experiments of Cohen and Singer [1999]. These weights are exploited in learning t.e.3). 1997. As for text search systems. Kim et al. In the BOOSTEXTER system [Schapire and Singer 2000].

ci ) = T | ˘ (d x .Machine Learning in Automated Text Categorization Table II. For instance. πM = ˆ |C| i=1 πi ˆ . and FNi (false negatives wrt ci . as the probability that. . that is. TPi + FNi For obtaining estimates of π and ρ. —macroaveraging: precision and recall are first evaluated “locally” for each category. πi and ρi are to be understood as subjective probabilities. recall wrt ci (ρi ) is defined as P ( (d x . As defined here. as the probability that if a random document d x is classified under ci . two different methods may be adopted: ACM Computing Surveys.e. a. Vol. Borrowing terminology from logic. Precision wrt ci (πi ) is defined as the conditional probability P ( ˘ (d x . The Contingency Table for Category c i Category Expert judgments ci YES NO Classifier TPi FPi YES Judgments FNi TNi NO Table III. ci ) = T | (d x . that is.a. ci ) = T ). ci ) = T ). everything we will say in the rest of Section 7 may be adapted to the case of macroaveraging in the obvious way. we will assume that microaveraging is used..1.a. Analogously. values global to the entire category set. TPi (true positives wrt ci ). March 2002.1. These probabilities may be estimated in terms of the contingency table for ci on a given test set (see Table II). while ρ may be viewed as its “degree of completeness” wrt C. Here. this decision is correct. |C| |C| where “M ” indicates macroaveraging. π may be viewed as the “degree of soundness” of the classifier wrt C. categories with few positive training instances) will be emphasized by macroaveraging and much less so by microaveraging. Classification effectiveness is usually measured in terms of the classic IR notions of precision (π ) and recall (ρ). 1. TNi (true negatives wrt ci ). . errors of omission) are defined accordingly.1. . . that is. to obtain π and ρ. and then “globally” by averaging over the results of the different categories: . in a way to be discussed shortly. These two methods may give quite different results.2. the ability of a classifier to behave well also on categories with low generality (i. especially if the different categories have very different generality. a. Other Measures of Effectiveness. These categoryrelative values may be averaged. if a random document d x ought to be classified under ci . TPi + FPi ρi = ˆ TPi . |C| i=1 (TPi + FNi ) |C| i=1 where “µ” indicates microaveraging. Whether one or the other should be used obviously depends on the application requirements. FPi (false positives wrt ci . adapted to the case of TC. No. ρM = ˆ |C| i=1 ρi ˆ πi = ˆ TPi . From now on. Precision and Recall. such as accuracy (estimated as 7. as measuring the expectation of the user that the system will behave correctly when classifying an unseen document under ci . Measures of Text Categorization Effectiveness Judgments NO FN = i=1 FNi TN = i=1 TNi 7.1. Estimates (indicated by carets) of precision wrt ci and recall wrt ci may thus be obtained as —microaveraging: π and ρ are obtained by summing over all individual decisions: πµ = ˆ ρµ = ˆ TP = TP + FP TP = TP + FN TPi . Measures alternative to π and ρ and commonly used in the ML literature. The Global Contingency Table Category set Expert judgments C = {c1 . that is.k. . c|C| } YES NO |C| |C| 33 Classifier YES TP = i=1 |C| TPi FP = i=1 |C| FPi 7. The “global” contingency table (Table III) is thus obtained by summing over category-specific contingency tables.k. errors of commission) is the number of test documents incorrectly classified under ci . this decision is taken. |C| i=1 (TPi + FPi ) |C| i=1 TPi . 34.

However..e. . “Standard” effectiveness is a special case of utility. where the numeric values uTP . c|C| } YES NO Classifier YES uTP uFP Judgments NO uFN uTN 7. If A is adopted. 34. i. false positive.34 TP+TN ˆ A = TP+TN+FP+FN ) and error (estimated FP+FN ˆ ˆ as E = TP+TN+FP+FN = 1 − A). ci ) = ˘ (d j . In general. 2000]. If the classifier outputs probability estimates of the membership of d j in ci . ci ) = F for all d j and ci ) tends to outperform all nontrivial classifiers (see also Cohen [1995a]. The reason is that. as Yang [1999] pointed out. although very important for applicative purposes. (uFN − uTP ) + (uFP − uTN ) which.e. who suggested basing π and ρ not on “absolute” values of success and failure (i.3.e. and true negative. However.. . the large value that their denominator typically has in TC makes them much more insensitive to variations in the number of correct decisions (TP + TN) than π and ρ. the one in which uTP = uTN > uFP = uFN . a class of measures from decision theory that extend effectiveness by economic criteria such as gain or loss. No. The Utility Matrix Category set Expert judgments C = {c1 .1. ci ) = T and 1 − CSVi (d j ) if ˘ (d j . in the frequent case of a very low average generality the trivial rejector (i.. the overall usefulness of the system. efficiency may be useful for choosing among classifiers with similar effectiveness.. are not widely used in TC. Vol. Less trivial cases are those in which uTP = uTN and/or uFP = uFN . 1 if (d j .e. . training efficiency (i.1). then decision theory provides analytical methods to determine thresholds τi . it might be appropriate for interactive (“ranking”) classifiers of the type used in Larkey [1999]. both uTP and uTN are greater than both uFP and uFN . Sebastiani Table IV. as a consequence. in spam filtering. . as Lewis [1995a] reminds us. CSVi (d j ) if ˘ (d j . criteria different from effectiveness are seldom used in classifier evaluation. An important alternative to effectiveness is utility. For instance. ci ) = F ). Section 7]. Utility is based on a utility matrix such as that of Table IV. false negative. . the average time it takes to build a classifier for category ci from a training set Tr). [1998].. March 2002. This means that for a correct (respectively wrong) decision the classifier is rewarded (respectively penalized) proportionally to its confidence in the decision. but on values of relative success (i.e. in the case of “standard” effectiveness.. ci ) = ˘ (d j . parameter tuning on a validation set may thus result in parameter choices that make the classifier behave very much like the trivial rejector. for example. Specifically. who have compared five different learning methods along three different dimensions. and classification efficiency (i. and is thus unfit for autonomous (“hard”) classification systems. is equal to 1 . due to the volatility of the parameters on which the evaluation rests. where the confidence that the classifier has in its own decision influences category ranking and. if A is the adopted evaluation measure. uFN and uTN represent the gain brought about by a true positive. This proposed measure does not reward the choice of a good thresholding policy.3). Measures Alternative to Effectiveness. effectiveness. 2 ACM Computing Surveys. efficiency. thus avoiding the need to determine them experimentally (as discussed in Section 6. the expected value of utility is maximized when τi = (uFP − uTN ) . the classifier such that (d j . where failing to discard a piece of junk mail (FP) is a less serious mistake than discarding a legitimate message (FN) [Androutsopoulos et al. An interesting evaluation has been carried out by Dumais et al. A nonstandard effectiveness measure was proposed by Sable and Hatzivassiloglou [2000.e. uFP . namely. 1. Besides. Section 2. the average time it takes to classify a new document d j under category ci ). is seldom used as the sole yardstick. respectively. ci )). ci ) and 0 if (d j . this is appropriate.

There is a breakup of “symmetry” between π and ρ here because. equal to the average test |C| ). this is the dichotomy of interest in trivial acceptor vs. top candidate [Larkey and Croft 1996]. and Schapire et al. recall at n [Larkey and Croft 1996]. The values of the utility matrix are extremely applicationdependent. However. π c = 1 and ρ = 0 for the trivial rejector. in the words of Riloff and Lehnert [1994]. Trivial Cases in TC Precision Recall TP TP TP + FP TP + FN Trivial rejector Trivial acceptor Trivial “Yes” collection Trivial “No” collection TP = FP = 0 FN = TN = 0 FP = TN = 0 TP = FN = 0 Undefined TP TP + FP TP =1 TP 0 =0 FP 0 =0 FN TP =1 TP TP TP + FN Undefined C-precision TN FP + TN TN =1 TN 0 =0 FP Undefined TN FP + TN C-recall TN TN + FN TN TN + FN Undefined 0 =0 FN TN =1 TN 35 The use of utility in TC is discussed in detail by Lewis [1955a]. there is a further element of difficulty in the cross-comparison of classification systems (see Section 7. the “symmetric” of ρ ( TPTP ) is not +FN π ( TPTP ) but C-precision (π c = FPTN ). [1996]. from the point of view of classifier judgment (positives vs. by symmetry. Neither precision nor recall makes sense in isolation from each other. the “con+FP +TN trapositive” of π . we are still far from universal agreement on evaluation issues and. it set generality i=1|C| is well known from everyday IR practice that higher levels of π may be obtained at the price of low values of ρ. 1]. Vol. [1998]. Combined Effectiveness Measures. Cohen and Singer [1999].4. No. Pearson product-moment correlation [Larkey 1998]. trivial rejector). F } is tuned to be. ci ) = T for all d j and ci (the trivial acceptor) has ρ = 1. coverage [Schapire and Singer 2000]. as a consequence. improving ρi to the detriment of πi ) or more conservative (improving πi to g T e (ci ) From this. ACM Computing Surveys. that the trivial rejector always has π = 1. In practice. it is clear from its definition (π = TPTP ) that π +FP depends only on how the positives (TP + FP ) are split between true positives TP and the false positives FP . This is false.e. and the TREC “filtering track” evaluations have been using it for a while [Lewis 1995c]. their use shows that.Machine Learning in Automated Text Categorization Table V. as π is undefined (the denominator is zero) for the trivial rejector (see Table V). negatives. 16 . by tuning τi a function CSVi : D → {T. Utility has become popular within the text filtering community. oneerror [Schapire and Singer 2000]. one might be tempted to infer. Other effectiveness measures different from the ones discussed here have occasionally been used in the literature. 34. In fact.16 Conversely. 1. This means that if utility is used instead of “pure” effectiveness. and does not depend at all on the cardinality of the positives. Other works where utility is employed are Amati and Crestani [1999].1. since for two classifiers to be experimentally comparable also the two utility matrices must be the same.. from understanding precisely the relative merits of the various methods. We will not attempt to discuss them in detail. although the TC community is making consistent efforts at standardizing experimentation protocols. In fact. while ρ = 1 and π c = 0 for the trivial acceptor. When the CSVi function has values in [0. π would usually be very low (more precisely. one only needs to set every threshold τi to 0 to obtain the trivial acceptor. March 2002. Lewis and Catlett [1994]. these include adjacent score [Larkey 1998]. In this case. In fact the classifier such that (d j . and top n [Larkey and Croft 1996]. 7. Hull et al.3). more liberal (i.

and averaged over the 11 resulting values. 1]. breakeven is the value of ρ (or π ) for which the plot intersects the ρ = π line. No.0. among which the most frequent are: (1) Eleven-point average precision: threshold τi is repeatedly tuned so as to allow ρi to take up values of 0. interpolated breakeven may not be a reliable indicator of effectiveness. 1992b]. is artificial. Yang [1999]. 18 An exception is single-label TC. [1994]. [1996]. Lewis [1992a]. . If for no values of the τi ’s π and ρ are exactly equal. 34. basis. Cohen and Singer [1999]. Vol. it needs to be redefined on a per-document. πi is computed for these 11 different values of τi . March 2002. since (i) there may be no parameter setting that yields the breakeven. [1996] and Yang [1999]. Apt´ e et al. .18 Various such measures have been proposed. Moulinier and Ganascia [1996]. always at the cost of decreasing ρi . Lewis and Ringuette [1994]. in which π and ρ are not independent of each other: if a document d j has been classified under a wrong category cs (thus decreasing πs ). [1999]. obtained by interpolation. This is analogous to the standard evaluation methodology for ranked IR systems. Sebastiani Joachims [1999].17 A classifier should thus be evaluated by means of a measure which combines π and ρ. Ng et al. this also means that it has not been classified under the right category ct (thus decreasing ρt ). (2) The breakeven point. (ii) to have ρ equal π is not necessarily desirable. ACM Computing Surveys. 17 While ρ can always be increased at will by lowi ering τi . This kind of tuning is only possible for CSVi functions with values in [0.19 (3) The Fβ function [van Rijsbergen 1979. [1999]. Moulinier et al. Chapter 7]. In this case. (b) with test documents in place of IR queries and categories in place of documents. Lewis [1995a]. [1995]). and an interpolated breakeven is computed as the average of the values of π and ρ. Dagan et al. whereas if β = +∞ then Fβ coincides with ρ.. . Yang [1999]). and it is not clear that a system that achieves high breakeven can be tuned to score high on other effectiveness measures. first proposed by Lewis [1992a.1. This idea relies on the fact that.0. since in this case ρi may not be varied at will. Joachims [1998]. if macroaveraging is used. where (β 2 + 1)πρ β 2π + ρ Here β may be seen as the relative degree of importance attributed to π and ρ.9. .36 the detriment of ρi ). This is obtained by a process analogous to the one used for 11-point average precision: a plot of π as a function of ρ is computed by repeatedly varying the thresholds τi . that is. Cohen and Singer [1999]. Usually. [1997]. 1. 1. which attributes equal importance to π and ρ. Yang [1999] also noted that when for no value of the parameters π and ρ are close enough.. πi can usually be increased at will by raising τi . Yang [1994]. the τi ’s are set to the value for which π and ρ are closest. for binary-valued CSVi functions tuning is not always possible. the breakeven of a classifier is always less or equal than its F1 value. Larkey and Croft [1996]. usually at the cost of decreasing πi . . In this case either π or ρ can be used as a measure of effectiveness. or is anyway more difficult (see Weiss et al. by decreasing the τi ’s from 1 to 0. . Lewis himself (see his message of 11 Sep 1997 10:49:01 to the DDLBETA text categorization mailing list—quoted with permission of the author) has pointed out that breakeven is not a good effectiveness measure. [1997]. Ruiz and Srinivassan [1999]). Schapire and Singer [2000]. Cohen [1995a]. This is most frequently used for document-ranking clas¨ sifiers (see Schutze et al. Yang and Pedersen [1997]).g. [1995]. Wiener et al. in this case the final breakeven value. . As shown in Moulinier et al. page 66). Fβ = 19 Breakeven. Lewis and Gale [1994]. This is most frequently used for category-ranking classifiers (see Lam et al. and may be used (a) with categories in place of IR queries. This measure does not make sense for binary-valued CSVi functions. has been recently criticized. the value at which π equals ρ (e. a value β = 1 is used. rather than per-category.g. If β = 0 then Fβ coincides with π. for some 0 ≤ β ≤ + ∞ (e. ρ always increases monotonically from 0 to 1 and π usually decreases monotonically from a |C| 1 value near 1 to |C| i=1 g Te (ci ).

att.com/~lewis/reuters21578 . if the parameter pk has already been tuned) and with different values for parameter p. Yang [1999] showed that experiments carried out on Reuters-22173 “ModLewis” (column #1) are not directly comparable with those using the other three versions. Note that only results belonging to the same column are directly comparable. in the sense that many of these experiments have been carried out in subtly different conditions. In particular. both on Reuters and on other collections. This will likely replace Reuters-21578 as the “standard” Reuters benchmark for TC. has not been performed with these caveats in mind: by testing three different classifiers on five popular versions of Reuters. [1999]. Tuning a parameter p (be it a threshold or other) is normally done experimentally.research.reuters.. or at the chosen value.000 documents. ´ Reuters-22173 “ModWiener” (column #3). Yang [1999] has shown that ACM Computing Surveys.e.2. 7. since this collection is the restriction of Reuters-21578 “ModApte” to ´ the 10 categories with the highest generality. has recently been made available by Reuters for TC experiments (see http://about. In general. Also.Machine Learning in Automated Text Categorization Once an effectiveness measure is chosen. Unfortunately.com/ researchandstandards/corpus/).g. called Reuters Corpus Volume 1 and consisting of roughly 800. with the same parameter values. since the former strangely includes a significant percentage (58%) of “unlabeled” test documents which. Ruiz and Srinivasan [1999]. The Reuters collection accounts for most of the experimental work in TC so far. thresholds and other parameters can be set) so that the resulting effectiveness is the best achievable by that classifier. and is thus an obviously “easier” collection. 34. being negative examples of all categories. a lot of experimentation.20 Only experiments that have computed either a breakeven or F1 have been listed. a lack of compliance with these three conditions may make the experimental results hardly comparable among each other. (2) with the same “split” between training set and test set. [1994] and used by Joachims [1998]. Vol. Reuters-21578 “ModApte” (column #4). different sets of experiments may be used for cross-classifier comparison only if the experiments have been performed (1) on exactly the same collection (i. The most widely used is the Reuters collection. the utility matrix chosen). Reuters-22173 “ModApte” (column #2). No. Table VI lists the results of all experiments known to us performed on five major versions of the Reuters benchmark: Reuters-22173 “ModLewis” (column #1). (3) with the same evaluation measure and. a classifier can be tuned (e. this does not always translate into reliable comparative results. . in the case of a yet-tobe-tuned parameter pk .g. [1996]. experiments performed on Reuters-21578[10] “ModApte” (column #5) are not comparable ´ with the others. and Yang 20 The Reuters-21578 collection may be freely downloaded for experimentation purposes from http:// www. Lam and Ho [1998]. Lewis et al. Benchmarks for Text Categorization 37 Standard benchmark collections that can be used as initial corpora for TC are publically available for experimental purposes. tend to depress effectiveness. This means performing repeated experiments on the validation set with the values of the other parameters pk fixed (at a default value. Lam et al. whenever this measure depends on some parameters (e. consisting of a set of newswire stories classified under categories related to economics. same documents and same categories). The value that has yielded the best effectiveness is chosen for p.. ´ and Reuters-21578[10] “ModApte” (column ´ #5). Other test collections that have been frequently used are —the OHSUMED collection. 1. set up by Hersh et al.. March 2002. since other less popular effectiveness measures do not readily compare with these.html. A new corpus. Unfortunately.

759 .5 decision trees [Joachims 1998] .662 3. 1997] .805 RIPPER decision rules [Cohen and Singer 1999] . 21 the categories are the “postable terms” of the MESH thesaurus.542 (MF1 ) and Pedersen [1997].603 9.670 SWAP-1 decision rules [Apt´ et al. “M” indicates macroaveraging and “F 1 ” indicates use of the F 1 measure.603 # of test documents 6. within parentheses.820 GIS-W example-based [Lam and Ho 1998] .617 .450 14. 1998] .edu/pub/ohsumed. [1998].MH committee [Schapire and Singer 2000] .299 3. 1994] e . 1998] .902 12.690 . The OHSUMED collection may be freely downloaded for experimentation purposes from ftp:// medir.310 . 1997] .720 probabilistic [Lam et al.625 CLASSI neural network [Ng et al.773 NB probabilistic [Yang and Liu 1999] .920 SVMLIGHT SVM [Joachims 1998] .841 SVMLIGHT SVM [Yang and Liu 1999] . boldface indicates the best performer on the collection) #1 #2 #3 #4 #5 # of documents 21. .747 probabilistic [Li and Yamanishi 1999] .849 BALANCEDWINNOW on-line linear [Dagan et al.795 decision trees [Dumais et al.646 ROCCHIO batch linear [Joachims 1998] .ohsu.850 Bayesian net [Lam et al.650 BIM probabilistic [Li and Yamanishi 1999] .752 . 1997] . McCallum et al.820 k-NN example-based [Yang 1999] .810 LLSF regression [Yang and Liu 1999] .443 (MF1 ) PROPBAYES probabilistic [Lewis 1992a] .852 .660 .38 Sebastiani Table VI.150 . 34.820 SLEEPINGEXPERTS decision rules [Cohen and Singer 1999] .738 CHARADE decision rules [Moulinier et al.802 NNET neural network Yang and Liu 1999] .855 .683 . McCallum and Nigam [1998]. entries indicate the microaveraged breakeven point.878 Bayesian net [Dumais et al. set up by Lang [1995] and used by Baker and McCallum [1998].860 committee [Weiss et al. Vol.290 probabilistic [Dumais et al.884 C4. Nigam et al.856 SVM [Dumais et al.902 # of training documents 14. 1998] .272 12.815 probabilistic [Joachims 1998] .748 .680 3. 1998] .870 . 1997] .838 neural network [Wiener et al.783 (F1 ) LLSF regression [Yang 1999] . 1998] .776 FINDSIM batch linear [Dumais et al. 1999] .746 3.21 The documents are titles or title-plus-abstracts from medical journals (OHSUMED is actually a subset of the Medline document base).811 .827 DL-ESC decision rules [Li and Yamanishi 1999] .822 ROCCHIO batch linear [Cohen and Singer 1999] .299 # of categories 135 93 92 90 10 System Type Results reported by WORD (non-learning) Yang [1999] .820 k-NN example-based [Yang and Liu 1999] .704 10.823 k-NN example-based [Lam and Ho 1998] . No.864 SVMLIGHT SVM [Li Yamanishi 1999] . 1995] .667 9.794 IND decision trees [Lewis and Ringuette 1994] .753 .781 ROCCHIO batch linear [Li and Yamanishi 1999] . 1996] .859 ADABOOST. (Unless otherwise noted. ACM Computing Surveys.800 .610 9.860 k-NN example-based [Joachims 1998] .833 (M) WIDROW-HOFF on-line linear [Lam and Ho 1998] . Joachims [1997].799 ROCCHIO batch linear [Lam and Ho 1998] . Comparative Results Among Different Classifiers Obtained on Five Different Versions of Reuters. March 2002.347 13. —the 20 Newsgroups collection.747 (M) . 1.820 CHARADE decision rules [Moulinier and Ganascia 1996] .

although slightly 7. The documents are messages posted to Usenet newsgroups. etc. —the AP collection. and the categories are the newsgroups themselves. and especially those listed in Table VI. using this tive “hardness” of and the results from Test 1.). Concerning the relative performance of the classifiers.2. which often are not discussed in the published papers. because in no case have a significant enough number of authors used the same collection in the same experimental conditions. Which Text Classifier Is Best? The published experimental results. .3. Cohen and Singer [1999]. used by Cohen [1995a.” ´ ´ These facts are unsurprising. These may include. different choices in preprocessing (stemming. 1995b]. 34. . . remembering the considerations above we may attempt a few conclusions: —Boosting-based classifier committees. and Schapire and Singer [2000]. efficiency considerations or applicationdependent issues might play a role in breaking the tie. [1996]. . [1998]. In this case various “background conditions. Lewis et al. thus making comparisons difficult. Two different methods may thus be applied for comparing classifiers [Yang 1999]: and —direct comparison: classifiers may be compared when they have been tested on the same collection . and regression methods deliver top-notch performance. Concerning the relative “hard> ness” of the five collections. indexing. we may obtain an indication on the relative efand . Test 2 gives an indication on the relaand .” often extraneous to the learning algorithm itself. However. No. March 2002. we have to bear in mind that comparisons are reliable only when based on experiments performed by the same author under carefully controlled conditions. —indirect comparison: classifiers and may be compared when (1) they have been tested on collections and . and Schapire et al. if by we indicate that is a harder collection . in particular. the first and the last inequalities are a direct consequence of the peculiar characteristics of Reuters-22173 “ModLewis” and Reuters-21578[10] “ModApte” discussed in ´ Section 7. dimensionality reduction. respectively. —Neural networks and on-line linear classifiers work very well. usually by the same researchers and with the ACM Computing Surveys. etc. among others. We will not cover the experiments performed on these collections for the same reasons as those illustrated in footnote 20. They are instead more problematic when they involve different experiments performed by different authors. examplebased methods. this method is less reliable. allow us to attempt some considerations on the comparative performance of the TC methods discussed. support vector machines. For the reafectiveness of sons discussed above. 1. but also different standards of compliance with safe scientific practice (such as tuning parameters on the test set rather than on a separate validation set). There seems to be no sufficient evidence to decidedly opt for either method. classifier parameter values. .Machine Learning in Automated Text Categorization [2000]. Lewis and Gale [1994]. (2) one or more “baseline” classifiers ¯ 1 . This is the more reliable method. Vol. there seems to be enough evithan dence that Reuters-22173 “ModLewis” Reuters-22173 “ModWiener” > Reuters22173 “ModApte” ≈ Reuters-21578 “Mod´ Apte” > Reuters-21578[10] “ModApte. 39 same background conditions.. Schapire and Singer [2000]. that is. may influence the results. Lewis and Catlett [1994]. ¯ m have been tested on both and by the direct comparison method. typically by different researchers and hence with possibly different background conditions. A number of interesting conclusions can be drawn from Table VI by using these two methods.

Lewis and Catlett 1994. March 2002. in which a decision tree classifier was shown to perform nearly as well as their top performing system (a SVM classifier). recent results by Schapire et al. and different classifiers may respond differently to these characteristics. the former is aimed at determining the significance of the difference in effectiveness between two methods in terms of the ratio between this difference and the effectiveness variability across categories. These tests are useful for verifying how strongly the experimental results support the claim is better than anthat a given system other system . —The data in Table VI is hardly sufficient to say anything about decision trees. linear regression. Joachims 1998. and Na¨ve Bayes.90 breakeven) belongs to CONSTRUE. . validating their results by means of the ANOVA test and the Friedman test. for completeness we should recall that one of the highest effectiveness values reported in the literature for the Reuters collection (a . Yang and Liu [1999] defined a full suite of significance ACM Computing Surveys. the fact that this figure is indicative of the performance of CONSTRUE. WORD is actually called STR in [Yang 1994. and neural networks to perform about 15% better than Rocchio. a note about the worth of statistical significance testing. Few authors have gone to the trouble of validating their results by means of such tests. each treated as a vector of weighted terms in the vector space model. [1995]. [1996]. has been convincingly questioned [Yang 1999]. showed ı all these classifiers to have similar effectiveness on categories with ≥ 300 positive training examples each. For Rocchio. Lewis and Ringuette 1994]. Yang and Chute 1994]. It is important to bear in mind that the considerations above are not absolute statements (if there may be any) on the comparative effectiveness of these TC methods. Therefore. The fact that this experiment involved the methods which have scored best (support vector machines. Hull [1994] and Schutze et al. One of the reasons is that a particular applicative context may exhibit very different characteristics from the ones to be found in Reuters. a classifier implemented by Yang [1999] and not including any learning component. An experimental study by Joachims [1998] involving support vector machines. —By far the lowest performance is displayed by WORD. [1995] have been among the first to work in this direction. this classifier has never been tested on the standard variants of Reuters mentioned in Table VI. [1998]. Vol. —Batch linear classifiers (Rocchio) and probabilistic Na¨ve Bayes classifiers ı look the worst of the learning-based classifiers. k-NN) and worst (Rocchio and Na¨ve Bayes) according to Table VI ı shows that applicative contexts different from Reuters may well invalidate conclusions drawn on this latter. k-NN. Sebastiani which the . [1998] ranked Rocchio along the best performers once nearpositives are used in training. will probably renew the interest in decision trees. who had found three classifiers based on linear discriminant analysis.40 worse than the previously mentioned methods.90 breakeven value was obtained was chosen randomly. Finally. a manually constructed classifier. However. as safe scientific practice would demand. Another no-learning classifier was proposed in Wong et al. these results ¨ confirm earlier results by Schutze et al. and it is not clear [Yang 1999] whether the (small) test set of Reuters-22173 “ModHayes” on 22 WORD is based on the comparison between documents and category names. However. and of the manual approach it represents. decision trees. or for verifying how much a difference in the experimental setup affects the measured effectiveness of a sys¨ tem . an interest that had dwindled after the unimpressive results reported in earlier literature [Cohen and Singer 1999. the work by Dumais et al. 1. No. WORD was implemented by Yang with the only purpose of determining the difference in effectiveness that adding a learning component to a classifier brings about.22 Concerning WORD and no-learning classifiers. 34. while the latter conducts a similar test by using instead the rank positions of each method within a category. Rocchio. However.

Examples of these are multiplicative weight updating (e. They applied them systematically to the comparison between five different classifiers. more importantly. ML researchers have found a challenging application.).g. Lewis et al. For other examples of significance testing in TC. this plateau will probably be higher than the effectiveness levels of manual TC.. 2000. Schapire and Singer [2000] classified answers given to a phone operator’s request “How may I help you?” so as to be able to route the call to a specialized operator according to call type. effectiveness levels comparable to those obtainable in the case of standard text can be achieved. WIDROW-HOFF. Cohen and Hirsh [1998]. Techniques typical of automated TC have already been extended successfully to the categorization of documents expressed in slightly different media. No. Koller and Sahami [1997]. 1995b]. by providing tools that quickly “suggest” plausible decisions. 1. and were thus able to infer finegrained conclusions about their relative effectiveness. boosting). and given the proliferation of documents in digital form they are bound to increase dramatically in both number and importance.g.. This means that TC is a good benchmark for checking whether a given learning technique can scale up to substantial sizes. texts affected by the same source of noise that is also at work in the test documents). The levels of effectiveness of automated TC are instead growing at a steady pace. 8.Machine Learning in Automated Text Categorization tests. Concerning other more radically different media. —It has reached effectiveness levels comparable to those of trained professionals. and support vector machines. by employing noisy texts also in the training phase (i. since datasets consisting of hundreds of thousands of documents and characterized by tens of thousands of terms are widely available. and Wiener et al. which provide a sharp contrast with relatively unsophisticated and weak methods such as Rocchio.e. Vol. it is unlikely to be improved substantially by the progress of research. etc. March 2002. for instance: —very noisy text resulting from optical character recognition [Ittner et al. Junker and Hoch 1998]. this probably means that the active involvement of the ML community in TC is bound to grow. —It can improve the productivity of human classifiers in applications in which no classification decision can be taken without a final human judgment [Larkey and Croft 1996]. Schapire and Singer 2000]. the WINNOW family. [1996]. and even if they will likely reach a plateau well below the 100% level. One of the reasons why from the early ’90s the effectiveness of text classifiers has dramatically improved is the arrival ACM Computing Surveys. see Lim [1999] for an interesting attempt at image categorization based . The success story of automated TC is also going to encourage an extension of its methods and techniques to neighboring fields of application. —speech transcripts [Myers et al. CONCLUSION 41 Automated TC is now a major research area within the information systems discipline. [1995] have found that. in the TC arena of ML methods that are backed by strong theoretical motivations. some of which apply to microaveraged and some to macroaveraged effectiveness. adaptive resampling (e. In their experiments Ittner et al. thanks to a number of factors: —Its domains of application are numerous and important. In TC. For instance. 1995. see Cohen [1995a. the situation is not as bright (however. The effectiveness of manual TC is not 100% anyway [Cleverdon 1984] and. In turn. —It is indispensable in many applications in which the sheer number of the documents to be classified and the short response time required by the application make the manual alternative implausible. 34. [1995]. Joachims [1997].

In Proceedings of SDAIR-94. Comput. KNORZ. France. Daniela Giorgetti. Distributional clustering of words for text classification. 160–167. 1994. 1994... 1998). pp. Scalable feature selection. G. J. AND SALVI. AND MCCALLUM. C. Optimizing convenient online access to bibliographic databases.. The automatic indexing system AIR/PHYS. pp. Univers. Serv. 1963. W. CAVNAR. E. P.. F. W. The reason for this is that capturing real semantic content of nontextual media by automatic indexing is still an open problem. 1. F. M. 37–47. Syst. 35. ed. 96–103. P.. classification and signature generation for organizing large text databases into hierarchical topic taxonomies. home. 1994). 1998a. 2.. F. 11th ACM International Conference on Research and Development in Information Retrieval (Grenoble. If we had solved the multimedia indexing problem in a satisfactory way. J. houses. to Evgeniy Gabrilovich. 719–736. Inform. 78–102. B. S. and texture. J. In Proceedings of SIGIR-88. S. W. C. housing. the general problem of image semantics is still unsolved. Chin. 163–178. CHANDRINOS.. Process. M. while the concept of a house can be “triggered” by relatively few natural language expressions such as house. Sci. Idea Group Publishing. 10. D. Also reprinted in Willett [1988]. AND SPYROPOULOS. 513–517. C. 1988. LUSTIG. MATWIN. CA. G. T. Greece. ANDROUTSOPOULOS. PA. In Advances in Inductive REFERENCES AMATI. M. NV. 1992. ACKNOWLEDGMENTS This paper owes a lot to the suggestions and constructive criticism of Norbert Fuhr and David Lewis.42 on a textual metaphor). 2000. G.. etc. AND TRENKLE. AND WEISS. inhabiting. and Alessandro Moschitti for spotting mistakes in an earlier draft. CLEVERDON. AGRAWAL. Categorization by context. W. G. 5. R. Hershey. ATTARDI. J. and to Alessandro Sperduti for many fruitful discussions. admits far fewer variations than the “languages” employed by the other media. FARRINGDON. and there are reasons to believe that the effectiveness levels could be as high. BELKIN. WA. For instance. in images by recognizing shapes. CHAKRABARTI. 3. 12. 34. 1998b. BORKO. Information filtering and information retrieval: two sides of the same coin? Commun. M. 1998. 201–208. 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne. DOM. etc. Vol. COHEN. for example.. B. Man. Automated learning of decision rules for text categorization. Inform. DOM.. No. Mach.. Probabilistic learning for selective dissemination of information.. the general methodology that we have discussed in this paper for text would also apply to automated multimedia categorization. AND SEBASTIANI. An experimental comparison of naive Bayesian and keywordbased anti-spam filtering with personal e-mail messages. ACM Computing Surveys. Automatic document classification. From research to application. A. B. L. 29– 38. CLACK.. 161–175.. Very Large Data Bases 7. D. 1997. N-grambased text categorization. FUHR. 2000). Use 4. AND BERNICK. P. 1984. 9. P. BIEBRICHER. A. ACM 35. DI MARCO. LIDWELL. 151–161. 32–41. DAMERAU.. on Inform. AND INDYK. 1998. ACM Trans. 1995a. . In Proceedings of SIGIR-00. AND YU. 1. AND RAGHAVAN. I. C. Enhanced hypertext categorization using hyperlinks. In Text Databases and Document Management: Theory and Practice. E. Autonomous document classification for business. 633–654.. the language of the text medium. Also reprinted in Sparck Jones and Willett [1997]. viewed from all possible perspectives.. 12. G. from all possible distances. J. J. In Proceedings of the 1st International Conference on Autonomous Agents (Marina del Rey. While there are systems that attempt to detect content. 1988).. F. CAROPRESO. 2001. 307–318. 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas. ACM International Conference on Management of Data (Seattle. 1997). 1998). S.. In Proceedings of SIGIR-98. D. 233–251. AND CROFT. B. AND SCHWANTNER. CHAKRABARTI. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization.. The main reason is that natural language. 1999. N. KOUTSIAS. J. In Proceedings of SIGMOD-98. Comput. M. N. V.. K. Assoc. it can be triggered by far more images: the images of all the different houses that exist. color distributions. BAKER. Learning to classify English text with ILP methods. 333–342. March 2002. S. 23rd ACM International Conference on Research and Development in Information Retrieval (Athens. J. AND CRESTANI. Sebastiani ´ APTE. K. Thanks also to Umberto Straccia for comments on an earlier draft. 4. S. 3.. of all possible colors and shapes. This only adds to the common sentiment that more research in automated contentbased indexing for multimedia documents is needed. Australia. H.

5. R. RI. B. Inform. A. Sci. No. Germany. MD. 48–63. 7th ACM International Conference on Information and Knowledge Management (Bethesda. COHEN. U. In Proceedings of LREC-98. 1998). Syst. DRUCKER. N. Y. In Proceedings of SIGIR-00. Experiments on the use of feature selection and negative evidence in automated text categorization. W. Syst. K. S. 103–130. AND HARSHMAN. 1999). A probabilistic learning approach for document indexing. N.. F. 2–3. URENA I I ´ LOPEZ. 528–552. 4th European Conference on Research and ACM Computing Surveys. W. 18. W. AND PFEIFER. 1995). 12. NY. ˜ D´AZ ESTEBAN. DAGAN. Inform. H. Surv. 1995b. Neural Netw. 1. LANDAUER. J. ACM Trans. Greece. FIELD. S. 1997). 10. M. B. T. MASAND. 1989. 1995. J. COHEN. VAPNIK. Mach. AND HIRSH.. 1984).. Vol. M` RQUEZ. In Proceedings of ECML-00. 169–173.. Intell. 3rd International Conference “Recherche d’Information Assistee par Ordinateur” (Barcelona. W. T. AND PAZZANI.. M. 1998. AND SAHAMI. Spain. H. 55– 72.. 2000). M. 1975. 1994. Y. KNORZ. The Netherlands. 195–217. SMITH. AND CAMPBELL.. 2000. 1985).. 1999. . FUHR. IOS Press. March 2002. AND ROTH. G. Joins that generalize: text classification using WHIRL. Mistakedriven learning in text categorization. InI tegrating linguistic resources in an uniform way for text classification tasks. AND WU. W. FRASCONI. 1997. 1999. 8. 13. L. G. Automatic text categorization and its applications to text retrieval. Heidelberg.. R. L. AND WALTZ. F. 1998). A.. 31. M. 4. S. Inform. DUMAIS. P. AND SINGER. AIR/X—a rule-based multistage indexing system for large subject fields. VAN RIJSBERGEN. N. S. P. 124–143. Text categorization for multi-page documents: A hybrid naive Bayes HMM approach.. W. 151–185. 487–497. Syst. HARTMANN. COHEN.. J. W. CA. HECKERMAN. 2000. In Proceedings of SIGIR-84. 1st International Conference on Language Resources and Evaluation (Grenada. AND GALLINARI. 55–63. T. ed. 1998. Contextsensitive learning methods for text categorization. M. FUHR. N.. Spain. SEBASTIANI. 17.. 1990.Machine Learning in Automated Text Categorization Logic Programming. G. In Causal Models and Intelligent Data Management. G. 1998. 25. AND GARC´A VEGA. 2/3 (March–May). 34. UK. 1991. P. In Proceedings of EMNLP-97. G. New directions in text categorization. 2. 256–263. DOMINGOS. Exploiting structural information for text classification on the WWW. DEERWESTER. 391–408. I.” A survey of probabilistic models in information retrieval. ed. AND CHEN. C. S. 92–115. CRESTANI. PLATT. Commun. Inform. Models for retrieval with probabilistic indexing. 4.. 1048–1054. ACM Trans. J. 2nd Conference on Empirical Methods in Natural Language Processing (Providence. Amsterdam. Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. 9. ACM Comput. FUHR. 207–216. S. ZARAGOZA. Germany. 23rd European Colloquium on Information Retrieval Research (Darmstadt. The Netherlands. SCHWANTNER. 3. A. In Proceedings of RIAO-91. COOPER. . 148–155. Gammerman.. Some inconsistencies and misnomers in probabilistic information retrieval. 1991. T. 6. 1984.. 3rd Symposium on Intelligent Data Analysis (Amsterdam. Soc... M. 11th European Conference on Machine Learning (Barcelona. FUHR. 29. ACM 35. 7th ACM International Conference on Research and Development in Information Retrieval (Cambridge. 30. . 4th International Conference on Knowledge Discovery and Data Mining (New York. SODA. V. C. AND VULLO. DUMAIS. A. L. Trading MIPS and memory for knowledge engineering: classifying census returns on the Connection Machine. 12th International Conference on Machine Learning (Lake Tahoe.. KAROV. In Proceedings of CIKM-98. L. A Boosting applied to word sense disambiguation. 43 DUMAIS. France. De Raedt. K. probably. 1998. In Proceedings of ICML-95. N. S. R. ESCUDERO.. FUHR. S.. FUHR. Inductive learning algorithms and representations for text categorization. Syst. “Is this document relevant? . 1991). AND KNORZ.. DENOYER. In Proceedings of IDA-99. 1. In Proceedings of ECIR01. 1. 129–141. D. FORSYTH. 2001. AND TZERAS. Syst. 2000). AND BUCKLEY. LUSTIG. 1st International Conference “Recherche d’Information Assistee par Ordinateur” (Grenoble. LALMAS. 41. H.. In Proceedings of ECDL-00. 1985. G.. AND SIMI. G. Spain. 2001).. ¨ FURNKRANZ. 2000. On the the optimality of the simple Bayesian classifier under zero-one loss. M. In Proceedings of RIAO-85. J. 1197–1204. Hierarchical classification of Web content. 141– 173. J.. 1. Learn. 23rd ACM International Conference on Research and Development in Information Retrieval (Athens. 391–407.. W. A probabilistic model of dictionarybased automatic indexing. FURNAS. D. M. J. Inform. ACM Trans. M. 1997. Man. 246–265. D. 1992. In Proceedings of KDD-98... Indexing by latent semantic indexing. Inform. ACM Trans. 2002. Probabilistic information retrieval as combination of abstraction inductive learning and probabilistic assumptions. DE BUENAGA RODR´GUEZ... HMM-based passage models for document classification and ranking. 606–623. 124–132. Amer. 1999.. J. CREECY. Springer. I. 100–111. 223–248. AND RIGAU. GALAVOTTI. Document. L. Inform. 1998). Retrieval test evaluation of a rule-based automated indexing (AIR/PHYS). IEEE Trans. H. L. D. Process. N. Text categorization and relational learning. 1999.

H. MO. In Proceedings of CAIA-90.-H. 59–68. No. 2/3 (March-May). 1999).. E. Spain. 2000). Inform. JOACHIMS.. ¨ KESSLER. 4. NJ. J. Detecting concept drift with support vector machines. JOACHIMS. Ireland. W. VA. 2000. Document classification by machine: theory and practice. AND SAHAMI. 18. Tcs: a shell for content-based text categorization. K. 23rd ACM International Conference on Research and Development in Information Retrieval (Athens. Text categorization of low quality images. 1998. In Proceedings of ICML-99. 103–105. I. M. P. PEDERSEN. In Proceedings of ICML-94. 1997. Vol. In ACM Computing Surveys. 2000). KOLLER. NUNBERG. M. In Proceedings of COLING-94. B. Internat. SINGER. 1995. In Proceedings of ACL-97. 116–122. ¨ HULL. R. AND SEBASTIANI. K. 1999. 32–38. 10th European Conference on Machine Learning (Chemnitz. 18th ACM International Conference on Research and Development in Information Retrieval (Seattle. In Proceedings of SIGIR-96. D. In Proceedings of SIGIR-00. IYER. L. 415–439. A. H. 1994).. W. AND ZHANG. Portugal. 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas. LEONE.. Method combination for document filtering. HAHN.. AND FUHR. W. ANDERSEN. M. JUNKER. 11.. 1999). In Proceedings of CIKM-99.. 1997). WALKER. 174–193. AND TOKUNAGA. 5th ACM International Conference on Research and Development in Information Retrieval (Berlin. R. T. D. LALMAS. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94. 167–174. A. AND AHN. 1999. 1997). B. T. D. IWAYAMA. Y.. AND SCHMANDT. Slovenia. 1059–1063. K. AND JOACHIMS. 1999. B. 1994). 1994. 70–77. M. Hierarchically classifying documents using very few words. M. A method for disambiguating word senses in a large corpus. 202–207. T. A probabillistic description-oriented approach for categorising Web documents. AND SCHUTZE. 1998).. Y. 3. P.. 475–482. 1994. 2000. T. N. D. CA. 1996).. 2002. 1994. LEWIS. An experimental evaluation of OCR text representations for learning document classifiers. . Germany. 2000). M. 1998.. 5. ¨ GOVERT. A decision theory approach to optimal automated indexing. 1994). 143–151. Guest editors’ introduction to the special issue on automated text categorization. Mining online text. 11th International Conference on Machine Learning (New Brunswick. 192–201. R. 19th ACM International Conference on Research and Development ¨ in Information Retrieval (Zurich. 1990). D.44 Advanced Technology for Digital Libraries (Lisbon. Inform. AND SCHUTZE. 1973. 34. Text filtering by boosting naive Bayes classifiers. 2. 279–288. In Proceedings of SIGIR-94. 487–494.-Y. 1995). Transductive inference for text classification using support vector machines. 301–315. 2nd International Conference on Recent Advances in Natural Language Processing (Tzigov Chark. O. A theory of relevance for automatic document classification. 1982.. 17th International Conference on Machine Learning (Stanford. T. 26. 17th ACM International Conference on Research and Development in Information Retrieval (Dublin. 1993. J. Computerassisted indexing. KLINKENBERG.. T. D. ACM 42. In Proceedings of CIKM-00. Cluster-based text categorization: a comparison of category search strategies. A. In Proceedings of SIGIR-82. JOHN.-T. 282–289. Greece. KIM. J. E. 1990. 1994. HERSH. A.. Bulgaria. AND YAROWSKY. HEAPS. Inform. 1971. 273–281. Ireland. A. 121–129. NIRENBURG. AND GUTHRIE. C. J. 1996. OHSUMED: an interactive retrieval evaluation and new large text collection for research. W. F.. D. 16th International Conference on Machine Learning (Bled. HULL. 1994). KNIGHT.. J. 320–326. ITTNER. TN. 8th ACM International Conference on Information and Knowledge Management (Kansas City. D. J. Syst. March 2002. Commun.. AND HARLEY. Exploiting thesaurus knowledge in rule induction for text classification. T. 168–175. Automatic detection of text genre. D. 268–278. N. D. Comput. L. 137–142. Document Analysis and Recognition 1. BUCKLEY. JUNKER. Germany.. 1997. CA. NV. 1995. JOACHIMS. In Proceedings of SIGIR-95. GUTHRIE. A. In Proceedings of ICML-00. SCHAPIRE. 1997. Intell. In Proceedings of ECML-98. 2000. Irrelevant features and the subset selection problem. 2000). 1995).. H. J.. AND PFLEGER. GALE. R. Text categorization with support vector machines: learning with many relevant features. Boosting for document routing. AND ABECKER. Control 22. JOACHIMS. Storage Retrieval 7. S. G. 1. AND HICKMAN. KNORZ. H. Human. 1997.. Switzerland. KOHAVI. A. D. 9th ACM International Conference on Information and Sebastiani Knowledge Management (McLean... In Proceedings of RANLP-97. 14th International Conference on Machine Learning (Nashville. M. G. A. 200–209. R. CHURCH.. GRAY. 35th Annual Meeting of the Association for Computational Linguistics (Madrid. 6th IEEE Conference on Artificial Intelligence Applications (Santa Barbara. WA. 1982). 58–61. In Proceedings of SDAIR-95. HAYES. D. AND SINGHAL. 15th International Conference on Computational Linguistics (Kyoto. Japan. G. In Proceedings of ICML-97. 17th ACM International Conference on Research and Development in Information Retrieval (Dublin. 1997). D. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. LEWIS. AND HOCH..

. LIDDY. Text classification using ESC-based stochastic decision lists. M. AND JAIN. In Proceedings of SDAIR-94. D. In Proceedings of DL-99. L. 1999. In Proceedings of SIGIR-96.. L. 1998). K. LEWIS. 1996). D. 15th International Joint Conference on Artificial Intelligence (Nagoya. 1995). AND HO. Inform. S. 231. Foundations of Statistical Natural Language Processing. Using a generalized instance set for automatic text categorization. AND SCHUTZE. A sequential algorithm for training text classifiers: corrigendum and additional data. LAM. In Proceedings of DASFAA-99. 1995. S. D. 1999. Automatic essay grading using text categorization techniques. 139–145. 37–50. D. 1994). S. 1992). H. 1992a. D. 1997).. 8. Classification of text documents. Wiley Computer Publishing. 289–297. 81–93. 246– 254. 1994. 45 LEWIS. AND YU. K. 298–306. 1994). S. LEWIS. L. R. D. R. Syst. WA. 6th IEEE International Conference on Database Advanced Systems for Advanced Application (Hsinchu. C. LI.. 1995). D. A. 865–879. In Proceedings of AAAI-97. B. F. W. Naive (Bayes) at forty: The independence assumption in information retrieval. 11th International Conference on Machine Learning (New Brunswick. See also Lewis [1995b]. P.. 179–187. An evaluation of phrasal and clustered representations on a text categorization task. J. In Proceedings of CIKM-99. LARKEY. R. ¨ MANNING. 148–156. Australia. TN. Data Engin. Taiwan. Denmark. Syst. 6. K.. 1994. 331–339. D. 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne. Training algorithms for linear text classifiers. Y. D. Switzerland. 1996. Learnable visual keywords for image classification. 1999). A patent search and classification system. AND RINGUETTE. Switzerland. Amherst. 12. D. 591–596. W. The TREC-4 filtering track: description and analysis. LEWIS. 1997. P. W. In Proceedings of SIGIR-95. 90–95. D. Department of Computer Science. AND SRINIVASAN. Guest editorial for the special issue on text categorization. NJ. Australia. AND HAYES. Combining classifiers in text categorization. Comput. LEWIS. J. 1999. SIGIR Forum 29. P. 19th ACM International Conference on Research and Development in Information ¨ Retrieval (Zurich. 1997). Vol. LIM. A comparison of two learning algorithms for text categorization. MA. SCHAPIRE. J. KORFHAGE. thesis. LEWIS. A sequential algorithm for training text classifiers. 1998. 4th ACM Conference on Digital Libraries (Berkeley. 1998). In Proceedings of ICML-95. LOW. AND GALE. In Proceedings of DL-99. Using a Bayesian network induction approach for text categorization. LAM. 1994.Machine Learning in Automated Text Categorization Proceedings of ICML-97. 11. 1995a. LEWIS. LAM. 1995c. Cambridge. LARKEY. In Proceedings of SIGIR-92. 8th ACM International Conference on Information and Knowledge Management (Kansas City. A. E. W. March 2002. 1994). New York. 4th Text Retrieval Conference (Gaithersburg. 81–89. In Proceedings of IJCAI-97. Japan. CA. 1997.. LEWIS. In Proceedings of ECML-98. 3–12. In Proceedings of SIGIR-98. 1998). AND YAMANISHI. MO. Evaluating and optmizing autonomous text classification systems.. 18th ACM International Conference on Research and Development in Information Retrieval (Seattle. 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas. Active learning with committees for text categorization. 1995b. R. 1. 1999. LIERE. In Proceedings of SIGIR-96. AND CROFT. Y. In Proceedings of SIGIR-94. 14th Conference of the American Association for Artificial Intelligence (Providence. R. 19th ACM International Conference on Research and ¨ Development in Information Retrieval (Zurich. D. RI. Inform. NV. D. AND CATLETT. LEWIS. ACM Trans. AND TADEPALLI. 1995). MA. ACM Trans. E. University of Massachusetts. 745–750. 1999. . 122–130. 3. LAM. 165–180. W. 12th International Conference on Machine Learning (Lake Tahoe. 1996). L. 1996. D. H. AND LEE. 13–19. C. NY. Y. E. 12. 4th ACM Conference on Digital Libraries (Berkeley. P. S. 4–15. IEEE Trans. Ph. Representation and Learning in Information Retrieval. H. 1994. 278–295. D. LEWIS. LI. 2. CA. 1998. 170–178. 1998. CALLAN. C. 1999. H. Text categorization for multiple users based on semantic features from a machine-readable dictionary.. 195–202. PAIK. 3. LARKEY. K. L. MIT Press. 14th International Conference on Machine Learning (Nashville. J. D. 1999). Ireland. ACM Computing Surveys. NEWSWEEDER: learning to filter netnews. D. Feature reduction for neural network based text categorization. W. AND PAPKA. 15th ACM International Conference on Research and Development in Information Retrieval (Copenhagen. Automatic text categorization and its applications to text retrieval. 537–546. D. 1999). In Proceedings of ICML-94. LANG. D. In Proceedings of SIGIR-98. M. 1998. 1994. 1999). In Proceedings of TREC-4. Information Storage and Retrieval. 1992b. 1997). MD. 41. No. J. AND HO. CA. 17th ACM International Conference on Research and Development in Information Retrieval (Dublin. 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne. D. 1997. D. D. 10th European Conference on Machine Learning (Chemnitz. 34. E. Heterogeneous uncertainty sampling for supervised learning. D. RUIZ. D. D. Germany. Knowl. LEWIS.

Learn. Cambridge. D. Heidelberg. WI. ACM 18. E. Statistical. SINGH. 3. In Advances in Genetic Programming. Man. AND LEE.. 103–134. Feature subset selection in text learning. NIGAM. ROSENFELD. McGraw Hill. Germany. Vol. 1984. M. Text classification from labeled and unlabeled documents using EM. Denmark. MERKL. pp. Mach. Chapter 21. 1/3. Information Extraction. I. Wermter. In Proceedings of AAAI-98. T.. In Proceedings of ECML-98. 1994. the Seventh Electrotechnical and Computer Science Conference (Ljubljana. ´ MLADENIC. Employing EM in pool-based active learning for text classification. Text classification with selforganizing maps: Some lessons learned. AND WALKER. E. 404–417. 39. Inform. M. SARACEVIC. D. SALTON. Libr. Information extraction as a basis for high-precision text classification. 359–367. AND YANG. M. 1998). 1975. 613–620. Soc. T. THRUN. Process. B. B. Amer. J. S. A. RUIZ. 6. AND WALTZ. D. 1988.. E. 130–136. 343–354. Vol. K. G. 806–813. 1998. Also reprinted in Sparck Jones and Willett [1997]. 2/3. R. K. In Proceedings of SIGIR-92. AND HATZIVASSILOGLOU. GOH. Schaler. MASAND. K. 1997. W. 59–65.-H. Also reprinted in Sparck Jones and Willett [1997]. 1976.. Term-weighting approaches in automatic text retrieval. Springer Verlag. Learning to resolve natural language ambiguities: a unified approach. 2000. 1998. Automatic indexing: an experimental inquiry. 3. AND BUCKLEY. WI. Applying an existing machine learning algorithm to text categorization. L. Kinnear. 40. H. MITCHELL. AND HARDING. WONG. Feature selection. Soc. AND GROBELNIK. AND SINGER.. 2000). J. W. AND MITCHELL. Also reprinted in Willett [1988]. J. D. Germany... 145–148. T. RILOFF. G. J. 1996. Comput. pp. M. pp. 10th European Conference on Machine Learning (Chemnitz.-G. ed. A vector space model for automatic indexing. M. 513–523. KEARNS. P. MYAENG. D. S. ed. I. Assoc. M. 281–282. 323–328. WI. Machine Learning. Dig.. 273–280. E. AND GANASCIA. SABLE. Slovenia.. 459–476. 26. No. 2000.. Little words can make a big difference for text classification. 2/3. A. 1996. 1998. In Proceedings of ICML-98. ROTH. pp. Inform. K. MCCALLUM. 24. Text categorization: a symbolic approach.-G. Germany. and G. Heidelberg. 87–99. 15th ACM International Conference on Research and Development in Information Retrieval (Copenhagen. 15th International Conference on Machine Learning (Madison. T. E. 2000. In Proceedings of SDAIR-96. H. S.. New York. perceptron learning. 1998). 143–165. Y. LINOFF. Springer. Hierarchical neural networks for text categorization. Optimising confidence of text classification by evolution of symbolic expressions. 18th ACM International Conference on Research and Development in Information Retrieval (Seattle. B. MYERS. 321–343. 1992.. Classifying news stories using memory-based reasoning. P. Greece. 2000. MOULINIER. AND NIGAM. 3. A. 350–358. 1. 1995. Probabilistic automatic indexing by learning from human indexers. AND NG. Inform. Document. K. 1992). Internat.. 12. MITCHELL. Textbased approaches for non-topical image categorization. . Neurocomputing 21. M. 39. 1996. 11. ˘ MOULINIER. 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley. ROBERTSON. K. K. Improving text classification by shrinkage in a hierarchy of classes. J. Learn. E. PAZIENZA. RILOFF. MCCALLUM. 20th ACM International Conference on Research and Development in Sebastiani Information Retrieval (Philadelphia. 1998). 5th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas. 1998). 8. In Proceedings of SIGIR-95. Sci. NG. C. 15th Conference of the American Association for Artificial Intelligence (Madison. 296–333. 23rd ACM International Conference on Research and Development in Information Retrieval (Athens. 1999. C. M. In Proceedings of SIGIR-00. 3. AND LOW. 264–271. 1994. AND SPARCK JONES. Commun. 1998. SALTON. M.. K. 1998. BoosTexter: a boosting-based system for text categorization. Also reprinted in Sparck Jones and Willett [1997].-J. 34. A. ACM Trans. Mach. 2000. C. L. March 2002. Word sequences as features in text-learning.. In Connectionist. A. T. G. In Proceedings of SIGIR-99. SCHAPIRE. S. 143–160. R. 27. Syst. MCCALLUM. 1996). CA. Inform. E. 1961. CA. 1975. 3. K. MA. Amer. Mach. 1998). 2000). 1997. 67–73. Lecture Notes in Computer Science. WA. PA. 1999). 261–275. E. 129–146. 61–77. Relevance: a review of and a framework for the thinking on the notion in information science. Y. 655– 662.. G. NY. In Proceedings of ICML-98. V. RASKINIS. MASAND. J.. NV. S. 1998. Sci. Relevance weighting of search terms. AND LEHNERT. 95–100. 1299.. 15th International Conference on Machine Learning (Madison..46 MARON. J. In Proceedings of ERK-98. MIT Press. A boosting approach to topic spotting on subdialogues. ´ MLADENIC. AND SRINIVASAN. In Proceedings of ICML-00. H. ACM Computing Surveys. A practical hypertext categorization method using links and incrementally available class information. Riloff. OH. 4. AND GANASCIA. 1995). In Proceedings of SIGIR-97. 17th International Conference on Machine Learning (Stanford. and Symbolic Approaches to Learning for Natural Language Processing. 264–270. and a usability case study for text categorization. S.. A. eds. M. 1997). 5. 135–168. T. ROBERTSON.

Learning routing queries in a query zone. X. R. Inform. 1977. O. Vol. Butterworths. 18th ACM International Conference on Research and Development in Information Retrieval (Seattle. E. In Proceedings of CIKM-98. 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia. AND BUCKLEY. 23rd European Colloquium on Information Retrieval Research (Darmstadt. AND WILLETT. AND TISHBY. 32. 1997). 1999. J. SINGHAL. March 2002. 1999. 12. C. 229–237. 1999). Feature selection in SVM text categorization. 1999). Y. 1998. F. TUMER. R. 106–119. Automatic word sense discrimination. In Proceedings of SIGIR-94. Retr. 13–22. A. 1988. AND HARTMANN. Error correlation and error reduction in ensemble classifiers. O. A. In Proceedings of ICML-99. In Proceedings of SDAIR-95.. In Proceedings of CIKM-00. CA. 1993). 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley. CA. VAN RIJSBERGEN. J. YANG. SLATTERY. Man. AND CHUTE. C. AND PEDERSEN. A study of approaches to hypertext categorization. M.. TAIRA. S. 1. A. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval.gla. WILLETT. 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas. F. ¨ SCHUTZE. 3-4. Retr. Intell. 219–241. An evaluation of statistical approaches to text categorization. J. 24. San Mateo. 2nd ed. 1998). A. 42–49. T. Morgan Kaufmann. 17th ACM International Conference on Research and Development in Information Retrieval (Dublin. 2002. Information Retrieval. J. PA.. Document. A. Automatic indexing based on Bayesian inference networks. 1995). Sci. In Proceedings of SIGIR-95. 97–124. 1995. S. SINGER. MD.. AND LIU.. In Proceedings of SIGIR-97. M. SPARCK JONES. UK. KAN. An example-based mapping method for text categorization and retrieval.. D. S. An improved boosting algorithm and its application to automated text categorization. 156–160. S. MITRA. M. 193–216. AND GHANI. 412–420. JOHNSON. AND HAMPP. GOETZ. 22–34. 7th ACM International Conference on Information and Knowledge Management (Bethesda. 1998. AND PEDERSEN. Inform. G. 1993. 3. J. E. 34.. 619–633. S. 16th International Conference on Machine Learning (Bled. 2/3 (March-May). 1999. Document Retrieval Systems. In Proceedings of AAAI-99. PA... SLONIM. 2000). 5. 317–332.uk/Keith. ´ WEISS. 2–4. J. 16th Conference of the American Association for Artificial Intelligence (Orlando. 480–486. TN. J. 1995). WEIGEND. 14.. 2... VA. London. IEEE Intell. 1999. A. S. J. Germany. AND YOUNG. O. 215–223. 385–403. Y. A neural network approach to topic spotting. Inform. AND BUCKLEY.. VAN 47 RIJSBERGEN. OLES. D. 1996. 1994. H. 1999.. KOK. Inform... Y.Machine Learning in Automated Text Categorization SCHAPIRE. A comparison of classifiers and document representations for the routing problem. SCOTT. I. 16th ACM International Conference on Research and Development in Information Retrieval (Pittsburgh. AND SPRINKHUIZEN-KUYPER. H. Process. Inform. TAURITZ.. 9th ACM International Conference on Information and Knowledge Management (McLean. F. A theoretical basis for the use of co-occurrence data in information retrieval. L. D. Syst.. 1994). M. 2000. Connection Sci. revised February 2001. J. Received December 1999. eds. 256–263. WONG. 2001. ACTION: automatic classification for full-text documents.dcs. TZERAS. 1998). Syst. D. Y... Australia. K. 1998. Available at http://www. P. In Proceedings of SIGIR-98. Y. In Proceedings of SIGIR-93. Readings in Information Retrieval. 3. SINGHAL. C. 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne. Slovenia. G. O.. K. Maximizing text-mining performance. YANG. P. ed. 121–140. AND PEDERSEN. In Proceedings of SIGIR-99. 8. London. G. PEDERSEN. 1997. S. YANG. 1996. 379–388. UK. N.. J. A new on-line learning algorithm for adaptive text filtering. In Proceedings of SIGIR-95. MITRA. Syst. 1. The power of word clusters for text classification. YANG. ¨ SCHUTZE.. YANG. 78–85. FL. D. K. Document length normalization. 2000. 18. 26–41. K. WIENER. N. AND GHOSH. AND MATWIN. 1999. E. YANG. G. Computat. 63–69. APTE. ACM Trans. 1. 2001). Taylor Graham. DAMERAU.. T. 33.. YU. Y. 1. 1997. 1997). W.ac. N. J. 1999). AND HARUNO. Exploiting hierarchy in text catagorization. In Proceedings of ICML-97. Noise reduction in a statistical approach to text categorization. No. In Proceedings of ECIR-01. AND WEIGEND. A comparative study on feature selection in text categorization. 252–277.. 14th International Conference on Machine Learning (Nashville. A. E. HULL. . W. WA. Inform. C. 18th ACM International Conference on Research and Development in Information Retrieval (Seattle. Ireland. 1979. N. R. 1995. 122. C. NV.. 1996. SEBASTIANI. 1997. 69–90. SPERDUTI. Y. 1995). accepted July 2001 ACM Computing Surveys. C. 1994. 1995. 25–32.. SIGIR Forum 30. Y. AND LAM. WIENER. Boosting and Rocchio applied to text filtering. YANG. 1–2. Feature engineering for text classification. 1. AND VALDAMBRINI. SALTON.. J. H. Ling. AND SINGHAL.-K. W. H. WA. J. 4. Adaptive information filtering using evolutionary computation. A re-examination of text categorization methods.

Sign up to vote on this title
UsefulNot useful