This action might not be possible to undo. Are you sure you want to continue?

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Text Categorization

Marta Capdevila and Oscar W. Ma´ rquez Flo´ rez, Member, IEEE

Abstract—The basic concern of a Communication System is to transfer information from its source to a destination some distance

away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view,

the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is

developed where Automatic Text Categorization (ATC) is studied under a Communication System perspective. Under this approach,

the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by

a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and

adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections

validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction

(reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of

final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by

state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).

Index Terms—Data communications, text processing, data compaction and compression, clustering, classifier design and evaluation,

feature evaluation and selection.

Ç

1 INTRODUCTION

A

deep parallelism may be established between a

Communication System and an Automatic Text Categor-

ization (ATC) scheme, since both disciplines deal with the

transmission of information and its reliable recovery. The

establishment of this novel simile allows to tackle the over-

dimensioned document representation space that is heavily

redundant with the classification task and typically turns

problematic to many categorizers [1] in ATC,

1

from a

founded Communication theoretical point of view. The main

objective of our research has been to investigate how and up

to which extreme the document representation space can be

compressed and what are the effects on final classification of

this compression. The idea behind is to set a first step toward

an optimal encoding of the category, carried by the

document vectorial representation, in view of both limiting

the greedy use of resources issued from the high-dimension-

ality feature space and reducing the effects of overfitting.

2

Additionally, our research also aims at showing how the

document decoding (or classification task) can take advan-

tage of common Gaussian assumptions made in the

Communication System discipline but fairly ignored in ATC.

This paper is structured as follows: In Section 2, ATC is

briefly reviewed, and in Section 3, the Communication

systems perspective is established. Sections 4 and 5 explain

the theoretical basis of the proposed Document Sampling

and Document decoding. Sections 6, 7, 8, and 9 are

dedicated to experimental results, and finally, Section 10

presents the conclusions.

2 AUTOMATIC TEXT CATEGORIZATION

ATC is the task of assigning a text document to one or more

predefined categories or classes,

3

based on its textual context.

It corresponds to a supervised (nonfully automated) process,

where categories are predefined by some external mechan-

ism (normally human) by establishing, at the same time, a

set of already labeled examples that form the training set.

Classifiers are generated from those training examples, by

induction, in the so-called learning phase. This forms the

machine learning paradigm (as opposed to the knowledge-

engineering approach) over ATC that is predominant since

the 1990s exponential universalization of electronic textual

information [1].

It is further generally assumed that categories are

exclusive (also known as nonoverlapping), meaning that a

document can only belong to a single category (single-label

categorization), as this scenario has been shown [1] to be

more general than the multilabel case.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 1027

. The authors are with the Signal and Communications Processing

Department, Telecommunication Engineering School, University of Vigo,

Ru´a Maxwell s/n, Campus Universitario Lagoas-Marcosende, E-36310

Vigo, Spain. E-mail {martacap, omarquez}@gts.tsc.uvigo.es.

Manuscript received 30 July 2008; revised 21 Nov. 2008; accepted 18 Dec.

2008; published online 8 Jan. 2009.

Recommended for acceptance by S. Zhang.

For information on obtaining reprints of this article, please send e-mail to:

tkde@computer.org, and reference IEEECS Log Number

TKDE-2008-07-0394.

Digital Object Identifier no. 10.1109/TKDE.2009.22.

1. A sound exception can be established for state-of-the-art SVM

categorizer, which arguably [2] is well adapted to the typical high-

dimensionality representation space of ATC and which benefits from the

improved performance of recent training algorithms [3].

2. In general terms, the problem of overfitting results from the

characterization of reality with too many parameters, which makes the

modeling too specific and poorly generalizable.

3. Categorization and classification designations are used indiscriminately

in this text, both indicating the described supervised process.

1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society

2.1 Document Vectorial Representation

The first step toward any compact document representation

is the definition of the indexing features. The indexing features,

also called terms, are the minimal meaningful constitutive

units (a common choice is to use words). The set of different

terms that appear in the collection of training documents

forms the vocabulary or alphabet of terms. Once the

alphabet chosen, the text document can be represented in

the terms space. In this indexing process, the sequentiality

or order of terms in the text is commonly lost. This is known

as the bag-of-words approach [1].

The problem is that the indexing vocabulary typically

reaches tens or hundred of thousands of terms. To work in

such a high-dimensionality space commonly turns proble-

matic. This is why, before initiating any classification task, a

filtering designed to reduce the term space dimensionality

is usually applied. There are basically two approaches to

dimensionality reduction: 1) Term selection, where a subset of

features is selected out of the original set and 2) Term

extraction, where chosen features are obtained by combina-

tion of the original features. In the latter approach,

Distributional Clustering [4], [5], [6], [7] is a supervised

clustering technique that has been shown to be very

effective at reducing the document indexing space with

residual loss in categorization accuracy.

2.2 Common Categorizers

In the following, we will shortly review two of the state-of-

the-art classifiers used in ATC, which will be extensively

referenced in our experiments.

2.2.1 Multinomial Naive Bayes (MNB)

MNB is a probabilistic categorizer that assumes a document

is a sequence of terms, each of themrandomly chosen among

the term vocabulary, independently from the rest of term

events in the document. Besides its oversimplified Naive

Bayes basis, MNB results in good real performance [8].

2.2.2 Support Vector Machines (SVMs)

SVM is a binary classifier that attempts to find, among all

the surfaces that separate positive from negative training

examples, the decision surface that has the widest possible

margin (the margin being defined as the smallest distance

between positive and negative examples to the decision

surface). SVM is particularly well adapted to ATC [2] and

stands for one of the best-performing categorizers [9].

3 A COMMUNICATION INTERPRETATION ON ATC

A communication system [10] has the basic function of

transferring information (i.e., a message) from a source to

a destination. There are mainly three essential parts of any

communication system: the encoder/transmitter, the trans-

mission channel, and the receiver/decoder. The encoder/

transmitter processes the source message into the encoded

and transmitted messages. The transmission channel is the

medium that bridges the distance from source to destina-

tion. Every channel introduces some degree of undesirable

effects such as attenuation, noise, interference, and distor-

tion. The receiver/decoder processes the received mes-

sages in order to deliver it to destination. A classical

digital communication system simplified model is repre-

sented in Fig. 1.

In its raw form, a text document is a string of characters.

Typically in ATC, a bag-of-words approach is adopted,

which assumes that the document is an order-ignored

sequence of words

4

that can be represented vectorially. It is

further assumed that the vocabulary used by a given

document depends on the category or topic it belongs to.

The ATC scheme can be modeled by a communication

system, as shown in Fig. 2.

3.1 The Encoder/Transmitter Model

The generation of a document is determined by a Category

encoder, which is a random selector of words, modulated by

the category C (i.e., the selection of words is a random event

different and characteristic of each category). For each

category input c

i

, the Category encoder is characterized by a

distinct alphabet

T

i

(which is the subset of

T

that contains

the words used by the documents of c

i

) and the conditional

probabilities of each element of this alphabet fjðt

1

jc

i

Þ.

jðt

2

jc

i

Þ . . . jðt

,

jc

i

Þ . . .g. Fig. 3 illustrates different example

alphabets

T

i

.

Actually, the Category encoder generates a sequence of

outcomes

5

that are the actual words that (partially) form the

document. In the communication nomenclature, each word

1028 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

Fig. 1. A classical digital communication system simplified model.

Fig. 2. ATC modeled by a communication system.

4. Words and terms are used indiscriminately in this text to designate the

meaningful units of language.

5. The outcomes may be considered independent, as in the Naive Bayes

approaches.

could be a symbol. Note that the length of the sequence is

random (i.e., not fixed, since some documents are short,

others long, randomly) but presumably category indepen-

dent. And, finally, let us indicate that the input c

i

is itself the

value of the outcome of another random event, the category

C characterized by an alphabet

C

¼ fc

1

. c

2

. . . c

j

C

j

g

6

with

probabilities j

c

¼ fjðc

1

Þ. jðc

2

Þ . . . jðc

j

C

j

Þg.

The Document builder measures the degree of contribu-

tion

7

of each word in the sequence generated by the

Category encoder and establishes a

T

-dimensional vector or

codeword with all the obtained weights. This process is

commonly known in ATC as the indexing of the document.

The final vectorial representation of the document, noted 1

in our scheme, constitutes the vector signal that is trans-

mitted over the channel.

The miscouplings between the ideally generated docu-

ments and the actual documents (i.e., introduction of lexical

borrowed from the vocabulary of other categories, words

iterations, etc.) are modeled by undesirable effects intro-

duced by the channel, namely Noise, Intersymbol interference,

and Channel distortion. Noise refers to random and unpre-

dictable variations superimposed to the transmitted signal.

Channel distortion is a perturbation of the signal due to

distorting response of the channel. And, finally, the

Intersymbol interference is a form of distortion caused by

the previously transmitted symbols.

The received signal d is the actual document that we

manipulate. The role of the receiver/decoder is to decode the

category out of the received document d (i.e., to perform the

document classification).

3.2 The Receiver/Decoder Model

Now, the problem is that, typically in ATC, the alphabet of

symbols

T

has an extremely high dimensionality. Many

words are semantically equivalent or category-related (i.e.,

the alphabets

T

i

are extensive) and a large amount of

others are not discriminative of any category in particular

(i.e., the intersection between sets

T

i

is large), see Fig. 3 to

get a visual perception upon this. The alphabet of symbols

T

may be said to carry a high degree of redundancy

and noise.

From a communication perspective, working with such a

redundant and noisy alphabet of symbols

T

implies a

suboptimal encoding of the category C that generates over-

dimensioned codewords. Apart from representing a waste of

resources in terms of use of channel capacity

8

and

processing economy, the over-dimensionality problem is not

insignificant since it happens to affect the Category decisor

task. From this point of view, we may wish to eliminate

noise and redundancy by filtering and compressing (of

course, ideally under a lossless compression) the alphabet of

symbols

T

.

This is what the blocks Prefilter, Noisy terms filter, and

Redundant terms compressor in Fig. 4 basically aim to. More

precisely, the Prefilter is a low-level filter which typically

includes removal of stopwords (i.e., articles, conjunctions,

etc.), infrequent words, nonalphabetical words, etc. The

Noisy terms filter eliminates words that are noninformative

(i.e., nondiscriminative) of the category variable. And,

finally, the Redundant term compressor clusters terms that

convey similar information over the category.

The resulting alphabet of symbols

1

is a lower

dimensional set of less noisy and less redundant features

(i.e., combinations of terms) that provides a more optimal

sampling space for documents. In fact, the space of symbols

is transformed with a view that the final documents, seen as

codewords, are characterized by 1) being as short as

possible, 2) having as little noise as possible, and 3) con-

taining the maximum information as possible.

Back to Fig. 2, the document filter applies the prefiltering

and noisy terms filtering just described upon the received

document d. The document is, thus, finally represented in a

lower dimensional space

0

T

, resulting in d

0

.

The document sampler projects the filtered document d

0

into the new space of features

1

, producing document

representation d

00

. This projection process implies a new

document quantization. In the case of a TF indexing,

quantization simply means adding up the weights of the

original words of a same cluster.

The category decisor has the task of decoding the

document. It is the actual supervised classifier that has

previously undergone a learning phase that, for simplicity

reasons, is not reflected in Fig. 2.

3.3 Further Remarks

Several interesting issues can be drawn from the commu-

nication analysis performed upon ATC. The first one is that

the document may be seen as a vector signal that encodes

the category source information. The signal space is

conformed by the alphabet of terms

T

defined by the

document collection. The main issue of this encoding

scheme is its high suboptimization both because of the high

CAPDEVILA AND M

ARQUEZ FL

**OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1029
**

Fig. 3. Venn diagram of the alphabets of terms of different categories.

Fig. 4. Supervised term alphabet

T

reduction process.

6. j

C

j denotes the size or cardinality of

C

.

7. Classically, the contribution of each word can be measured by either a

binary, a term frequency (TF), or a term frequency inverse document

frequency (TFIDF) weight scheme, among others [1].

8. The concept of channel capacity can be assimilated to the concept of

storage capacity.

dimensionality of the signal space (

T

)

C

) and because

of the fact (directly related to the latter) that a same

category may be encoded by many different codewords.

Our work directly tackles the document-space optimi-

zation. Inspired by the communication simile established,

we ideally pursue the goal of extracting (up to the extent it

may be practically possible) an orthonormal basis for the

signal space. Under this inspiration, we have designed a

term alphabet transform (i.e., filtering and compression)

that improves the optimality of the category encoding

conveyed by the resampled documents. This should result

in a better utilization of storage and processing resources

as well as, hopefully, facilitate the document decoding or

classification task.

Note that the framework of the problem, as it has been

initially established, is not to perform a coding design and

then try to adapt the document representations to it. Instead,

we have opted to optimize, to the extent it may be possible,

the signal space in the hope of improving classification.

The document resampling model is further developed

in Section 4, while Section 5 deepens the document

decoding aspects.

4 DOCUMENT SAMPLING

4.1 Document-Space Analysis

In their vectorial representation, documents are represented

in the space of terms, and thus, following the communica-

tion simile established in Section 3, terms have to be

thought as being the basis functions of the document space.

The nutshell of the document-space analysis is how to

represent terms by means of a function. Which information

do we have about terms? How could we characterize them?

The answer to these questions resides in the fact that in a

supervised scheme such as ATC, we are given a set of

prelabeled documents. Based on the latter, terms can be

expressed in terms of the information they convey on

categories. Once terms have been properly characterized by

a function, the notion of the orthogonality and redundancy

between them (basis functions of the document space)

can be pursued.

4.2 Distributional Representation of Terms

As previously expressed in Section 3, the generation of a

text document may be modeled as a random selection of

terms dependent on category C. In other words, the

probability of appearance of a term in a document depends

on the category the document belongs to. Terms can be

understood as the outcomes of a r.e. T that is mutually

dependent with the category r.e. C. The term is the

observable data while the category is the unknown parameter.

Thereby, any term t

,

can be characterized as a distribu-

tional function )

t

,

over the space of categories

C

)

t

,

:

C

! ½0. 1.

c 7!)

t

,

ðcÞ.

ð1Þ

An intuitive alternative (but not the only one) for this

distributional function is the conditional probability mass

function (PMF) j

Cjt

,

[4]. Note that the conditional probabil-

ities j

Cjt

,

ðc

i

jt

,

Þ are not known, but in a supervised scheme

such as ATC, these can be approximated from the training

set of documents

1

tioii

jðc

i

jt

,

Þ %

#ðt

,

.

1

tioii

i

Þ

#ðt

,

.

1

tioii Þ

. ð2Þ

being #ðt

,

.

1

tioii

i

Þ the number of times the term t

,

appears

in all training documents belonging to category c

i

and

#ðt

,

.

1

tioii Þ the number of times the term t

,

appears in the

whole train collection.

4.3 Alphabet of Symbols

T

Filtering

and Compression

As announced in Section 3, we can envisage to reduce the

symbol alphabet

T

in two distinct directions:

1. Noisy terms filtering—Terms that have a flat function

)

t

,

, that is,

)

t

,

ðc

i

Þ % )

t

,

ðc

:

Þ. 8i. : ð3Þ

do not convey information on the target category.

These terms, from a communication perspective, are

noisy and should be eliminated. A noise filter should

discriminate between informative terms and noninfor-

mative or noisy terms. A dispersion measure needs to be

defined, as well as a selection threshold, upon it. Note

that the threshold will have to be set experimentally.

2. Redundant terms compression—Terms that convey

similar information on the target category random

event, that is,

)

t

i

ðc

i

Þ % )

t

,

ðc

i

Þ. 8i ð4Þ

are redundant. A (ideally) lossless data compression

scheme reduces redundancy by clustering terms

with similar distributional representation. A similar-

ity measure needs to be defined.

4.4 Dispersion and Similarity Measures

As just seen, to perform the document compression, we

need to establish measures on the information conveyed by

the term and the redundancy between terms. But which

metrics to use? The answer is not straightforward. We may

use distinct dispersion and similarity measures depending

on different interpretations on what the distributional term

representation )

t

,

is.

4.4.1 The PMF Interpretation

Distributional functions )

t

,

are commonly PMFs. Information

Theory (IT) [11] provides useful measures to quantify

information conveyed by random events (e.g., entropy)

and similarity between PMFs (e.g., Kullback-Leibler and

Jensen-Shannon divergences).

The IT approach has been commonly adopted by other

works on Distributional Clustering [4], [5], [6]. We have

chosen to follow a new and unexplored direction, the

discrete signal interpretation, which is rooted in Communica-

tion and Signal Processing related concepts, coherently with

the proposed general framework setup.

1030 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

4.4.2 The Discrete Signal Interpretation

A discrete signal is a set of j

C

j measurements of an

unknown but latent random variable (r.v.). The rationale

behind a discrete signal interpretation of )

t

,

is that we are

interested in analyzing the general “shape” of the distribu-

tions. By modeling those distributions with a latent random

variable, small differences between distributions are assimi-

lated by the random nature of the signal.

. A dispersion measure: Sample variance. The variance is a

measure of the statistical dispersion of an r.v. For a

given discrete r.v. A with PMF j

A

defined in

A

, it is

expressed as:

o

2

A

¼

4

1½ðA Àj

A

Þ

2

¼ 1½A

2

Àj

A

2

. ð5Þ

where 1½qðAÞ ¼

P

r2

A

qðrÞj

A

ðrÞ denotes the ex-

pectation operation and j

A

¼ 1½A the expected

mean of A.

Now, a discrete signal is a bunch of measure-

ments on an r.v. The underlying PMF is unknown,

and thus, the expectation operator 1 cannot be

computed. The variance of a discrete signal )

t

,

, also

called sample variance, is thus obtained by substitut-

ing the expectation operator in (5) by the arithmetic

mean as follows

9

:

:

2

)

t

,

¼

4

1

j

C

j

X

c

i

2

C

)

t

,

ðc

i

Þ Ài

)

t

,

2

. ð6Þ

where i

)

t

,

¼

1

j

C

j

P

c

i

2

C

)

t

,

ðc

i

Þ is the arithmetic mean

of )

t

,

.

Sample variance is bounded in the interval

½0;

j

C

jÀ1

j

C

j

2

. Noninformative terms are those with low

dispersion among categories (i.e., )

t

,

with flat

distribution). They are thus characterized by their

low variance.

. A similarity measure: Sample correlation coefficient.

Correlation refers to the departure of two variables

from independence. Pearson’s correlation coefficient in

(7) is the most widely used measure of relationship

between two random variables A and Y . It evaluates

the degree to which both functions are linearly

associated (equals 0 if they are statistically indepen-

dent and, in the other extreme, Æ1 if they are linearly

dependent)

,

A.Y

¼

4

1

o

A

o

Y

1

AY

½ðA Àj

A

ÞðY Àj

Y

Þ. ð7Þ

As with the variance, in the case of two discrete

signals )

t

,

and )

t

/

, the correlation coefficient is

expressed by its sample version:

i

)

t

,

)

t

/

¼

4

P

j

C

j

i¼1

)

t

,

ðc

i

Þ Ài

)

t

,

)

t

/

ðc

i

Þ Ài

)

t

/

j

C

j:

)

t

,

:

)

t

/

. ð8Þ

4.5 Clustering Algorithms

Now let us turn to the clustering algorithms used for the

redundancy compression task. Our main approach in this

research has been to adopt an agglomerative term cluster-

ing approach, disregarding efficiency aspects apparently

improved by divisive clustering methods as pointed out by

Dhillon [6]. The reason is that our interest has been

focussed in studying the influence of the number of clusters

built and its optimal number more than working on

algorithm efficiency aspects.

4.5.1 Initial Agglomerative Approach

The first clustering implementation conveyed has been

inspired by the agglomerative hard clustering

10

algorithm

proposed by Baker [4]. The algorithm is simple and scales

well to large vocabulary sizes, since instead of comparing

the similarity of all pairs of words, it restricts the comparison

to a smaller subset of size ` (` being the final number of

clusters desired). After the data set preprocessing and noisy

terms filtering, the words vocabulary is ordered in decreas-

ing variance order. Then, the algorithm initializes the

` clusters to the ` first words of the sorted list. It follows

on by iteratively comparing the ` clusters and merging the

closer ones. Empty clusters are filled with next words in the

sorted list.

When merging occurs, the distribution of the new

cluster becomes the weighted average of the distributions

of its constituent words. For instance, when merging

terms t

,

and t

/

into a same cluster, the resulting

distribution function is:

)

t

,

_t

/

ðcÞ ¼

jðt

,

Þ

jðt

,

Þ þjðt

/

Þ

)

t

,

ðcÞ þ

jðt

/

Þ

jðt

,

Þ þjðt

/

Þ

)

t

/

ðcÞ. ð9Þ

This algorithm has been named as Static window Hard

clustering (SH clustering). Static window refers to the fixed

`-dimensional window it is based on, while Hard clustering

denotes the nonoverlapping nature of clustering.

4.5.2 Dynamic Window Approach

A further agglomerative clustering algorithm has been

implemented where the fixed `-dimensional Static window

has been replaced by a Dynamic window scheme. The

rationale of this procedure is to avoid the forcing of distant

clusters merging due to the `-dimensional fixed size of the

working window, specially when ` is low.

The algorithm proceeds as the former Hard clustering

procedure except that the initial window size is set to an

input value \ 6¼ `. The window is iteratively expanded

whenever no pair of clusters with intercluster distance

lower than a certain threshold exist. In a subsequent step,

when all vocabulary terms have been assigned to a cluster,

the window is progressively contracted until its dimension

reaches the number ` of desired clusters. Toward this

objective, the intercluster distance threshold is progres-

sively incremented following an arithmetic progression

(whose common difference has to be set in an input

parameter). At each step, the merging of “close” clusters

is performed.

CAPDEVILA AND M

ARQUEZ FL

**OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1031
**

9. In this document, biased estimates are adopted. Alternatively, unbiased

estimates could be used by substituting j

C

j by j

C

j À1.

10. Hard clustering assumes that each term can only belong to one cluster.

Clusters do not overlap. They produce a partition (disjoint subsets) of terms.

4.5.3 Soft Clustering Approach

A soft clustering

11

approach has been designed in order to

accept different semantic contexts for a same term. The

implementation of a soft clustering model is notably more

computationally expensive than a hard clustering scheme,

since it demands an iterative procedure where the degree of

proximity of each pair of clusters is analyzed.

4.5.4 Clustering Algorithms Implemented

From the combination of the different approaches exposed,

four agglomerative clustering algorithms have been im-

plemented, namely: Static Window Hard clustering (SH),

Static Window Soft clustering (SS), Dynamic Window Hard

clustering (DH), and Dynamic Window Soft clustering (DS).

4.6 Document Quantization

Once the clustering algorithm has ended, we assume each

of the resulting term clusters to be a symbol of the new

alphabet

1

(i.e., an indexing dimension for the document

sampling). The indexing or quantization of the document

can be simply done by a term frequency (TF) weighting

scheme such as:

d

,

i

¼ T1ð,. iÞ ¼ #ð/

,

. d

i

Þ. ð10Þ

where #ð/

,

. d

i

Þ is the number of times the terms of cluster /

,

appear in d

i

. In this simple indexing, the classical inverse

document frequency (IDF) factor is ignored because it has

already been, to a certain extent, taken into account in the

noise filtering phase (i.e., terms that appear uniformly in

documents of all categories have been identified as noisy

terms, and thus, eliminated).

We may ignore document length as we assume it is

independent from the category. In order to normalize all

resulting document vectors, whenever necessary, we have

adopted a cosine normalization.

5 DOCUMENT DECODING

We cannot expect that the final alphabet of symbols

obtained exactly map a pure orthogonal set of basis as

eventually desired. And, consequently, documents re-

sampled in the new term-clusters space can be assumed

to be (to a certain extent) corrupted codewords of the

ideal category encoding. Adopting communication termi-

nology, we may say they constitute the actually received

messages, contaminated by the undesirable effects intro-

duced by the transmission channel. To sum up, the

decoding of a document sampled in the term-clusters

space that ideally would be a straightforward extraction

of the category, cannot as such directly implemented

in practice due to the channel noise, interference,

and distortion influence. This brings us to an Optimum

Detection problem.

5.1 MAP Decoder

The optimization criterion can be formulated in terms of

jðc

i

j

d

/

Þ, that is, the conditional probability that c

i

was

selected by the source given that the document

d

/

12

is

received. If

jðc

i

j

d

/

. HÞ jðc

,

j

d

/

. HÞ 8, 6¼ i. ð11Þ

(where H denotes the overall hypothesis space), then the

decoder should decide that the transmitted symbol was the

category c

i

. This constitutes the basis of a maximum a

posteriori (MAP) or probabilistic decoder that is expressed as:

e

cð

d

/

Þ ¼ max

c

i

jðc

i

j

d

/

. HÞ. ð12Þ

Now, the posterior probability jðc

i

j

d

/

. HÞ can be straight-

forwardly estimated by Bayesian inference jðc

i

j

d

/

. HÞ ¼

jð

d

/

jc

i

.HÞjðc

i

jHÞ

jð

d

/

jHÞ

. Given that the evidence jð

d

/

jHÞ does not

depend on category c

i

, the classification criterion simplifies

to the following expression that constitutes the discrimina-

tive function of MAP categorizers:

e

c

`¹1

ð

d

/

Þ ¼ max

c

i

jð

d

/

jc

i

. HÞjðc

i

jHÞ. ð13Þ

5.2 Gaussian Assumption (Discriminant Analysis)

The Gaussian assumption is a classical modeling assumption

heavily used in areas such as Signal Processing and Commu-

nication System but poorly applied in the field of ATC (see

Section 5.2.3 for a discussion on this assertion).

The Gaussian model assumes that each category encod-

ing characterizes a multivariate Gaussian or Normal Prob-

ability Density Function (PDF). A document

d is then

assumed to be a realization of an i-dimensional random

vector

1 that is dependent on the category output c

i

with

the following Gaussian PDF Nð j

i

. Æ

i

Þ:

)

Nð j

i

.Æ

i

Þ

ð

dÞ ¼

1

ð2¬Þ

i

2

jÆ

i

j

1

2

c

À

1

2

ð

dÀ j

i

Þ

T

Æ

À1

i

ð

dÀ j

i

Þ

. ð14Þ

where the mean vector j

i

¼ 1½

1jc

i

is an i-dimensional

vector and the covariance matrix Æ

i

¼ 1½ð

1jc

i

À j

i

Þð

1jc

i

À

j

i

Þ

T

is an i Âi-dimensional positive-definite

13

matrix with

positive determinant jÆ

i

j.

Now, the likelihood jð

d

/

jc

i

Þ can be expressed in the

following terms, where 1

Nð j

i

.Æ

i

Þ

denotes the probability

distribution function

14

:

jð

d

/

jc

i

Þ ¼ lim

Á!

0

jð

d

/

d

d

/

þÁjc

i

Þ

¼ lim

Á!

0

1

Nð j

i

.Æ

i

Þ

ð

d

/

þÁÞ À1

Nð j

i

.Æ

i

Þ

ð

d

/

Þ

% )

Nð j

i

.Æ

i

Þ

ð

d

/

Þ ÂÁ

!

0

.

ð15Þ

The factor Á appears in the numerator of (15) for each

category. Consequently, it does not affect classification, and

thus, the MAP criterion translates to the following dis-

criminative function:

1032 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

11. Soft or fuzzy clustering allows a term to belong to more than one

cluster. Clusters may then overlap.

12. Note that

d

/

corresponds to the document signal vector, after the

filtering and compression processes described in Section 3 (and thus,

properly corresponds to d

00

/

).

13. Recall that a positive-definite matrix is a symmetric matrix with all its

eigenvalues positive. A positive-definite matrix is always invertible or

nonsingular. Its determinant is always positive.

14. By definition, probability distribution 1

1

and density )

1

functions

are related, )

1

ð cÞ

4

¼

0

i

1 1

ð cÞ

0c1...0ci

.

e

cð

d

/

Þ ﬃ max

c

i

1

jÆ

i

j

1

2

c

À

1

2

ð

d

/

Àj

i

Þ

T

Æ

À1

i

ð

d

/

Àj

i

Þ

jðc

i

jHÞ

( )

. ð16Þ

Given that the logarithmic function is a monotonically

increasing function, (16) is normally expressed as:

e

cð

d

/

Þ ﬃ max

c

i

ln jðc

i

jHÞ À

1

2

ln jÆ

i

j

&

À

1

2

ð

d

/

Àj

i

Þ

T

Æ

À1

i

ð

d

/

Àj

i

Þ

'

.

ð17Þ

5.2.1 Quadratic Discriminant Analysis (QDA)

ATC being a supervised classification scheme, both j

i

and

Æ

i

can be estimated from the training set of documents

1

tioii

i

that belong to category c

i

. The discriminative

functions in (16) and (17) describe a quadratic shape for

each category and the decision frontiers are also quadratic.

5.2.2 Linear Discriminant Analysis (LDA)

Let us suppose that the covariance matrices for all

categories are identical. This constitutes the homoscedastic

simplifying assumption. The discriminant function (17)

simplifies to:

e

cð

d

/

Þ ﬃ max

c

i

lnðjðc

i

jHÞÞ À

1

2

j

i

T

Æ

À1

j

i

þ j

i

T

Æ

À1

d

/

& '

. ð18Þ

This corresponds to a linear separative surface (i.e., a

hyperplane).

5.2.3 Applying Discriminant Analysis to ATC

The first problem that appears when trying to apply

discriminant analysis (DA) to ATC is the so-called i )`

i

problem.

15

The number of variables (indexing terms or

symbols) is typically extremely high (tens or hundred of

thousands) while the number of sample documents is

moderately small. Moreover, note that even if this problem

would not exist, the computation of extremely high-

dimensional covariance matrices is not feasible.

As a result of these limitations, DA can only be

envisaged after a preliminary indexing terms space reduc-

tion phase.

One of the pioneers in using DA in ATC were Schutze

et al. [12] who applied LDA classifier to the routing task, in

1995.

More recently, in 2006, Li et al. [13] used discriminant

analysis for multiclass classification. Their experimental

investigation showed that LDA reaches accurate perfor-

mance comparable to the one offered by SVM with a neat

improvement in terms of simplicity and time efficiency both

in the learning and the classification phases.

16

5.3 Independence Assumption

The independence or Naive Bayes assumption that states

that terms are stochastically independent can be formu-

lated over the Gaussian MAP categorizer. Under this

statement, the covariance matrix Æ is the diagonal matrix

of variances

Æ ¼

o

2

1

0 . . . 0

0 o

2

2

. . . 0

.

.

.

.

.

.

.

.

.

0 0 . . . o

2

i

2

6

6

6

4

3

7

7

7

5

. ð19Þ

The covariance matrix determinant simply is jÆj ¼

o

1

. . . o

i

and the inverse covariance matrix is straightfor-

wardly deduced. The estimation of Æ is, computationally

speaking, drastically simplified, since it reduces to the

computation of the i variances.

It can be easily shown that, under the independence

assumption, the discriminative function (16) is simplified to

the product of the univariate Gaussian PDFs in each of the

document indexing directions:

e

cð

d

/

Þ ﬃ max

c

i

È

)

Nðj

i1

.o

2

i1

Þ

ðd

/1

Þ . . . )

Nðj

ii

.o

2

ii

Þ

ðd

/i

Þ

Âjðc

i

jHÞ

É

.

ð20Þ

5.3.1 Quadratic GNB

The quadratic-GNB obeys (20). The category-dependent

parameters can be estimated by the arithmetic mean j

i,

$

i

i,

¼

1

`

i

P

d

/

2

1

tioii

i

d

/,

and the sample variance o

2

i,

$ :

2

i,

¼

1

`

i

P

d

/

2

1

tioii

i

ðd

/,

Ài

i,

Þ

2

.

5.3.2 Linear GNB

Under the homoscedastic hypothesis, the variance is con-

sidered to be category independent. It can be estimated

as before by substituting `

i

by ` and

1

tioii

i

by

1

tioii .

Under this simplification, (20) defines a linear GNB as

expressed in (18).

5.3.3 White Noise GNB

A further simplifying assumption may consider the

variance to be one and the same for all categories and

variables. This generates the, by us baptized, White Noise-

GNB (or simply WN-GNB)

17

categorizer. It can be easily

shown that, after applying the logarithmic function,

discriminative function (20) simplifies to:

e

c

`¹1

ð

d

/

Þ ﬃ max

c

i

À

1

2

P

i

,¼1

ðd

/,

Àj

i,

Þ

2

o

2

þlnðjðc

i

jHÞÞ

( )

.

ð21Þ

In the case of equiprobable categories, the WN-GNB

categorizer defined in (21) reduces to a (euclidean)

minimum distance categorizer.

5.3.4 A New Hybrid GNB Categorizers Family

Document data sets are characterized by extremely sparse

matrices (even after the resampling envisaged in Section 4).

The majority of the variables (i.e., indexing terms) are not

CAPDEVILA AND M

ARQUEZ FL

**OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1033
**

15. When the number of dimensions of the random vector i is greater

than the quantity of sample data `, then the estimated covariance matrix

does not have full rank, and thus, cannot be inverted.

16. The advantage of adopting an LDA strategy is that it is a natural

multiclass classifier, while SVM is a binary classifier that has to be adapted

to the multiclass problem, by reducing it to a binary classification scenario,

which is a nontrivial task.

17. In Communications, a white noise is a Gaussian-distributed noise that

has a flat spectral density. It is called white noise by analogy to white light.

By similitude, we use the terminology white noise to designate a Gaussian

noise that affects uniformly all document vectors.

representative of the target category. This means that, for a

big part of indexing attributes, mean and variance are null

theoretically.

18

This fact affects negatively the computation

of discriminative function in (20), and as for those

attributes, a small deviation from zero of an indexing

weight d

/j

results in a close to zero value for the term

)

Nðj

ij

.o

2

ij

Þ

ðd

/j

Þ, which (when summed up repeatedly) can

end by setting to nil the overall probability.

With the idea to mitigate the effects of the sparsity existing

in ATC, we have envisaged to set a variance lower bound.

6 EXPERIMENTAL SCENARIO

Before entering into this section, note that some aspects of

the experimental scenario adopted, such as the tackling of

the overlapping issue in Reuters-21578, and the effective-

ness measure used, are reviewed in rather detail since they

form a different treatment from commonly followed

solutions [1].

6.1 Standard Collections

Two of the most widely used standard collections, the

20 Newsgroups (NG) and the Reuters-21578 (RE) data sets,

form our experimental scenario. In our experiments, both

collections have been preprocessed by removing stopwords

(from the Weka [14] stop list) and nonalphabetical words.

Infrequent terms occurring in less than four documents or

appearing less than four times in the whole data set, have

also been filtered.

6.1.1 20 Newsgroups

The 20 Newsgroups data set is a collection of approximately

20,000 newsgroup documents, partitioned (nearly) evenly

across 20 different newsgroups, each corresponding to a

different topic. We used the bydate version of the data set

maintained by Jason Rennie,

19

which is sorted by date into

training (60 percent) and test (40 percent) sets.

6.1.2 Reuters-21578

The “Reuters-21578, Distribution 1.0 test collection”

20

is

formed using 21,578 news that appeared in the Reuters

newswire in 1987, classified (by human indexers) according

to 135 thematic categories, mostly concerning business and

economy. In our experiments, we have used the most

common subset of Reuters-21578, ModApte´ training/test

partition which only considers the set of 90 categories with

at least one positive training example and one positive test

example. It results in a partition of 7,769 training documents

and 3,019 test documents.

Several factors characterize Reuters-21578 data set [15],

notably: categories are overlapping (i.e., a document may

belong to more than one category) and distribution across

categories is highly skewed (i.e., some categories have

very few labeled documents—even only one—while others

have thousands).

6.1.3 Tackling the Category Overlapping Issue

Reuters-21578 data set classification is inserted in a multi-

label categorization frame, which is, by nature, out of the

single-label categorization scheme assumed in most of ATC

research works, including this. The category overlapping

issue can be tackled in three different ways:

. By deploying the 1 nonexclusive categories into all

the possible 2

1

category combinations.

. By assuming that categories are independent, and

thus, reverting 1-categories classification problem

into 1 independent binary classification tasks.

. By ignoring multilabeled documents which consti-

tute approximately 29 percent of all documents in

Reuters-21578.

Classically, the category independence alternative has

been implicitly undertaken in most researches [15]. Minority

authors as [16] have optedto ignore multilabeleddocuments.

In our work, we have opted to deploy the 1 ¼ 90 Reuters-

21578 categories into all possible 2

1

¼ 2

90

combinations,

which results in an impressive number of the order of 10

27

.

The reasons for our decision are basically:

. Our conceptual framework, both for document

sampling and decoding, is based in a multiclass

(not binary) scheme.

. By deploying categories, we avoid assuming any

independence hypothesis.

. 379 (out of more than 10

27

!) is the actual number of

category combinations that have at least one docu-

ment representative in the training set (out of these

379, only 126 category combinations are represented

in the test set).

6.2 Effectiveness Measures

6.2.1 Confusion Matrix

In a single-label multiclass classification scheme, the

classification process can be visualized by means of a

confusion matrix É. Each column of the matrix represents the

number of documents in a predicted category, while each

row represents the documents in an actual category.

In other words, referring to Table 1, ·

11

would be the

number of documents of category c

1

that are correctly

classified under c

1

, while ·

12

corresponds to documents of

1034 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

TABLE 1

An Example of Confusion Matrix É

18. Note that in practice, variance has to be set to a minimal value;

otherwise, univariate normal PDFs expressions could not be computed.

19. http://people.csail.mit.edu/jrennie/20Newsgroups/.

20. The Reuters-21578 corpus is freely avaible for experimentation

purposes from http://www.daviddlewis.com/resources/testcollections/

reuters21578.

c

1

incorrectly classified into c

2

and ·

21

corresponds to

documents of c

2

incorrectly classified into c

1

.

6.2.2 Precision and Recall Microaverages

Precision and recall

21

are common measures of effectiveness

in ATC [1]. Overall measures for all categories are usually

obtained by microaveraging:

22

¬

j

¼

T1

T1 þ11

¼

P

j

C

j

i¼1

T1

i

P

j

C

j

i¼1

ðT1

i

þ11

i

Þ

. ð22Þ

c

j

¼

T1

T1 þ1`

¼

P

j

C

j

i¼1

T1

i

P

j

C

j

i¼1

ðT1

i

þ1`

i

Þ

. ð23Þ

It can be easily seen that, in the single-label multiclass

classification model, the former expressions are equivalent.

They both result fromthe quotient of the sumof the diagonal

components of matrix É and the sum of all elements of

matrix É, which happens to be the overall classification

accuracy which is the quotient of correct classifications

(numerator) and total correct and incorrect classifications

(denominator).

6.3 Technological Solutions

The implementations undertaken in this research are based

on the Weka 3 Data Mining Software [14]. In particular, we

have used: “NaiveBayesMultinomial,” which is the MNB

categorizer implemented in Weka and “Weka LibSVM

(WLSVM),” which is the integration of LibSVM into Weka

environment.

23

7 EXPERIMENTAL RESULTS ON NOISY

TERMS FILTERING

This section provides empirical evidence that the novel

noisy terms filtering ensures a beneficial feature space

reduction.

7.1 20 Newsgroups

7.1.1 Experimental Scenario

In the noisy terms filtering scheme that was designed, the

setting of sample variance threshold t

:

2 (from this point on,

simply noted as t) has to be heuristically tuned. The only

thing we a priori know is that it is bounded in the interval

½0; :

2

ior

¼

j

C

jÀ1

j

C

j

2

that results in ½0; 0.0475 for j

C

j ¼ 20.

We have graphically represented in Fig. 5 the effect of the

variation of sample variance threshold on the classification

accuracy for a term clustering of 20,100 and 5,000 clusters,

respectively. The classifiers used have been the classic MNB

and SVM. The clustering algorithm tested is static window

hard clustering (SH clustering).

24

The similarity measure

used is the sample correlation coefficient in both cases. The

rest of clustering parameters are the default ones (see

Section 8).

7.1.2 Analysis of Results

The curves corresponding to MNB and SVM classification

in Fig. 5 have parallel evolutions. They present a “mountain

shape” with a maximum in the t range of [0.005; 0.02]. Note

that this range assures a classification accuracy decrease

lower than 5 percent for the “20 clusters curve” (which

happens to be our target clustering as we will see in

Section 8). Outside this range, classification accuracy

decrease is considered to be too high.

7.2 Reuters-21578

7.2.1 Experimental Scenario

As in the case of 20 Newsgroups, the dispersion measure

used has been sample variance. A priori, we only know that

the variance threshold is bounded in the interval ½0; :

2

ior

¼

j

C

jÀ1

j

C

j

2

that results in ½0; 0.002631 for j

C

j ¼ 379.

We have graphically represented in Fig. 6 the effect of the

variation of sample variance threshold on the classification

accuracy for a term clustering of 200, 500 and 1,000 clusters,

respectively. The classifiers used have been the classic MNB

and SVM and the clustering algorithm tested SH clustering.

The similarity measure used is the sample correlation

CAPDEVILA AND M

ARQUEZ FL

**OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1035
**

Fig. 5. 20 Newsgroups—Effect of the variation of the sample variance threshold on the classification accuracy for a term clustering of 20,100, and

5,000 clusters, respectively. The clustering algorithm used is static window hard clustering version with sample correlation coefficient as similarity

measure. (a) MNB classifier. (b) SVM classifier.

21. Precision (¬

i

) is the probability that if a random document d

,

is

classified under c

i

, this decision is correct. Recall, (c

i

) is the probability that

if a random document d

,

ought to be classified under c

i

, this decision is

taken.

22. T1 stands for True Positive, 11 for False Positive, T` for True

Negative, and 1` for False Negative.

23. Weka LibSVM is publicly available from http://www.cs.iastate.edu/

~yasser/wlsvm/.

24. Similar results are obtained with dynamic window soft clustering

(DS clustering), but, due to space limitations, they are not shown here.

coefficient in both cases. The rest of clustering parameters

are the default ones (see Section 8).

7.2.2 Analysis of Results

The curves corresponding to MNB and SVM classification

have slightly different evolutions. MNB classifier seems to

be more robust to the presence of noisy terms (low values of

variance threshold). In any case, both classifiers (or more

precisely, the combination of clustering/classifier) may be

considered reasonably robust to the presence of noisy

terms. The optimal range for t is ½0.00005; 0.001. Note that

this range assures a maximum decrease of accuracy of

5 percent for the “200 clusters curve” (our target clustering

as we will see in Section 8). In both cases, the absolute

maximum is obtained for a value of t of 0.0003.

Similar results are obtained when clustering is per-

formed by the DS algorithm, with the difference that the

optimal range for t is restricted to ½0.0002; 0.001.

7.3 Interpreting and Extrapolating Results

Both in 20 Newsgroups andReuters-21578, we have seen that

a range of values exist for the sample variance threshold

where a maximum of classification accuracy is reached.

Outside these bounds, the classification accuracy decreases:

. When t < |onci-/onid, the selection is too unrest-

rictive and noisy terms affect negatively term

clustering, which results in classification accuracy

deterioration.

. When t njjci-/onid, the selection is too restrictive

and informative terms are eliminated, which also

results in accuracy deterioration.

While in some cases (e.g., SH clustering on Reuters

collection), classification accuracy seems not to be affected

by any variance threshold lower-bound (i.e., robustness to

noisy terms), a threshold upper-bound always exists. This is

reasonable as one cannot limitlessly eliminate terms without

reducing classification accuracy. The quid of the question

here is to find an effective term reduction (that fastens the

term clustering process) while preserving classification

accuracy. In other words, we have the following compro-

mise: the higher t is, the faster the clustering process results,

but the higher t is, the lower the classification accuracy

becomes.

The problem is how to determine a priori (not experi-

mentally) the optimal values for the variance threshold

lower and upper-bounds.

There is a clear correspondence of relative t values in

both cases. In the case of Newsgroups, the optimum range

for t is ½0.1053:

2

ior

; 0.4211:

2

ior

, which corresponds to a

range of 88-41 percent terms selected. For Reuters, it is

½0.08:

2

ior

; 0.38:

2

ior

, which corresponds to a range of 94-

33 percent terms selected.

We may extrapolate these particular cases into a general

scenario (that should, of course, be verified in future

experiments with other collections). The optimal range for

t may be set to ½0.11:

2

ior

; 0.38:

2

ior

(this would embrace the

limits set by both collections).

8 EXPERIMENTAL RESULTS ON CLUSTERING

OF TERMS

This section provides empirical evidence that our redun-

dancy compression model drastically outperforms two of

the most effective feature selection functions [17], namely,

Information Gain (IG) and Chi-square (CHI), and allows an

aggressive dimensionality reduction with minimal loss (in

some cases, benefits) on final classification accuracy.

8.1 20 Newsgroups

8.1.1 Experimental Scenario

In the preliminary term selection phase of this experimental

scenario, noisy terms have been eliminated using the

sample variance measure and a threshold set to 0.02. As a

result of this, out of the initial 32,631 preselected terms of

the training set, 13,465 informative words have been

selected and served as the basis of the clustering phase.

We applied the four clustering variants with sample

correlation coefficient similarity function. In the dynamic

window approaches, the initial window size has been set to

100 and the contraction step was empirically set to 0.0001.

25

8.1.2 Discussion of Basic Performance

We have obtained the categorization accuracy of the MNB

categorizer upon NG data set indexed in the resulting

clusters space. This has been referenced to the results

1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

25. Experimentally small values for this parameter, lower than 0.0001,

have showed better results than higher ones.

Fig. 6. Reuters-21578—Effect of the variation of the sample variance threshold on the classification accuracy for a term clustering of 200, 500, and

1,000 clusters, respectively. The clustering algorithm used is static window hard clustering version with the sample correlation coefficient as the

similarity measure. (a) MNB classifier. (b) SVM classifier.

issued from classic IG and CHI selection functions applied

to the raw data set with same preprocessing (removal of

stopwords and nonalphabetical words) and same space

reduction factor.

Fig. 7 shows parallel results to the ones obtained by the

previous research works on Distributional clustering [4], [5],

[6]. The clusters categorization accuracy curves are notably

better than those of classic IG and CHI term selection

functions. They present an abrupt initial increase up to

20 clusters (accuracy in the range of 74 percent to 76 percent,

maximum value 76.0489 percent), and from there, they

asymptotically get to the maximum accuracy of 78.2528 per-

cent obtained by a full-feature MNB classifier.

26

We can say that clustering is good, with only a residual

loss of classification accuracy of 2.82 percent, for 20 clusters

or more, which is the number of categories defined in

20 Newsgroups collection. It results in an indexing term-

space reduction of original 113,357 words into only 20 word

clusters. The reduction factor is ¸ ¼

113.357À20

113.357

¼ 0.9982.

8.1.3 NG Indexed by Clusters

Fig. 8a illustrates how NG data set is indexed in the

obtained clusters space.

27

The graphic represents the

documents vectors (in the y-axis) indexed in the term-

cluster space (x-axis). Documents have been normalized

and they have also been arranged as to the category they

belong to. The graphic has thus to be “read” following

the 20 horizontal “bands” that can be identified in the

figures. Each band corresponds to all documents belong-

ing to the same category. Each single document is a line

inside this band. Basically, each category is mainly

identified by a single and distinctive cluster.

28

Finally, NG, indexed by 100 term clusters in Fig. 8b,

gives a very graphical understanding of the power of

clustering. Here again, basically, 20 clusters are active,

while a large part of the figure is void. This asserts the idea

that increasing the number of clusters up to more than the

number of categories is roughly unnecessary.

8.2 Reuters-21578

8.2.1 Experimental Scenario

We applied our four clustering variants on the Reuters-

21578 preprocessed training set upon which noisy terms

have been filtered by a sample variance threshold of 0.0003

(the collection is finally indexed by 8,550 informative

terms). We have used sample correlation coefficient. The

step in the contraction phase of the dynamic window

approaches was empirically set to 0.00001 and the window

size to 100. Finally, the whole collection of documents has

been indexed in the resulting space of clusters.

8.2.2 Discussion of Basic Performance

Fig. 9 shows parallel results to the ones obtained with

20 Newsgroups data set. Curves corresponding to Distribu-

tional clustering algorithms present an abrupt initial increase

up to 200 clusters (accuracy 76-82 percent, depending on the

clustering and the categorizer used), and from there, they

asymptotically get to a maximum accuracy of the order of

the maximum of 81.5953 percent obtained by classical

selection functions (IG or CHI) with SVM. Note that in

similar conditions, MNB obtains a maximum accuracy of

78.453 percent.

In other words, the indexing space can be reduced from

the original 32,539 words to 200 clusters (i.e., two orders of

magnitude) with no loss (even gain) of categorization

accuracy. The reduction factor is ¸ ¼

32.539À200

32.539

¼ 0.9939.

With this reduction, there is a gain of classification accuracy

of 4.53 percent (from 82.1478 percent to 78.453 percent

accuracy obtained when indexing with 10,000 terms

selected with indiscriminately IG or CHI functions) when

using the MNB classifier. When using SVM, the loss of

classification accuracy is only 2.24 percent (from79.7676 per-

cent to 81.5953 percent). Most notably, an overall maximum

accuracy of 82.1478 percent is reached when RE is indexed

by the 200 clusters obtained by DS clustering and classified

with MNB.

Qualitatively, our clustering approach essentially im-

proves the deterministic annealing algorithm (a Distribu-

tional Clustering algorithm) proposed by Bekkerman et al.

[7] who found that simple word indexing was a more

efficient representation than word clusters for Reuters-

21578 (10 largest categories).

29

The authors argue that the

bad performance of their algorithm is due to the

structural “low-complexity” of Reuters corpora (compared

to 20 Newsgroups, for instance) that was found not to

significantly improve classification accuracy when docu-

ment indexation was done using a great number of

words. Our results show that classification accuracy over

Reuters-21578 is clearly improved by using word clusters

instead of words, when indexed by a small number of

features. One of the basic differences between Bekker-

man’s approach and ours that may explain our results,

CAPDEVILA AND M

ARQUEZ FL

**OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1037
**

26. Note that Baker and McCallum [4] and Dhillon et al. [6]

categorization accuracy maximum values are (in absolute terms) slightly

superior to ours due to the fact they have used a bigger train split (2/3 train-

1/3 test random split against our 60 percent train-40 percent test Rennie’s

publicly available “bydate” split). They refer [6] to achieve 78.05 percent

accuracy with only 50 clusters, just 4.1 percent short of the accuracy

achieved by a full-feature classifier. Our results with only 20 clusters,

minimize this loss to 2.82 percent, thus representing relatively improved

results.

27. Produced by DS clustering using sample correlation coefficient as

similarity measure and a sample variance threshold of 0.015 (i.e., 18,046

terms selected).

28. This is the general idea. Nevertheless, when analyzed more in detail,

Fig. 8 shows secondary clusters for some of the categories which indicate

subject relationships.

29. To our knowledge, Bekkerman et al. research is the only work that

has applied Distributional Clustering to Reuters-21578 data set.

30. Issued from the decomposition of the original 90 overlapped

categories.

Fig. 7. 20 Newsgroups—Classification accuracy of Distributional

Clustering algorithms vs. Information Gain and Chi-square term

selection functions with Multinomial Naive Bayes categorizer, using

sample correlation coefficient as similarity measure.

apart from the noisy term filtering process, is that the

former solve the overlapping issue in Reuters collection

by assuming the obviously violated assumption of

category independence, while we avoid such hypothesis

(see Section 6.1.3).

The 379 (nonoverlapped) categories

30

Reuters-21578

collection can thus be optimally indexed with only

200 clusters. It should be noted that out of the total

379 categories of the training set, only 126 are effectively

represented in the test set. This fact, together with the

extremely insignificant representation of some categories,

may explain why the optimal number of clusters needed for

indexing is lower than the number of categories.

8.2.3 RE Indexed by Clusters

Fig. 10 illustrates how Reuters data set is indexed in the

obtained clusters space produced by DS clustering (the

figure shows a detailed view on most representative

clusters). In order to facilitate the “lecture” of the figures,

the ten most populated categories have been labeled

(horizontal grids delimitate the corresponding bands). The

figure clearly translates the uneven category distribution of

the data set. Clusters corresponding to dominant categories

earn and acq, clearly mark two vertical lines in the figure.

In general, these clusters are quite active in all documents.

A possible explanation for this is that these clusters are so

big

31

that they tend to be generalist. Another way to see it is

that all categories are related to the general subject of

business and economy that these clusters (identifiers of

categories earn and acq) globally represent. Anyway,

when examined in detail, a singular and discriminative

cluster can be identified for each category, except for

category interest&money—fx that shows similar index-

ing to category money—fx. Note that both former categories

are mainly indexed by a fairly big cluster.

9 EXPERIMENTAL RESULTS ON DOCUMENT

DECODING

9.1 20 Newsgroups

In Fig. 11a, it can be observed that

32

hybrid Q-GNB

presents good classification results, surpassing MNB and

LDA in the range of 20-200 word clusters. These results

1038 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

Fig. 8. 20 Newsgroups training set indexed by clusters. (a) Training set indexed by 20 clusters. (b) Training set indexed by 100 clusters.

Fig. 9. Reuters-21578—Categorization accuracy of Distributional Clustering versus Information Gain and Chi-square. Dispersion and similarity

measures used are sample variance and sample correlation coefficient. (a) MNB categorizer. (b) SVM categorizer.

30. Issued from the decomposition of the original 90 overlapped

categories.

31. Uneven cluster size has a direct correspondence with uneven

category distribution. Most populated categories (i.e., earn and acq),

which by their side condensate more than 57 percent of all documents, are

related to two single clusters that together represent more than 52 percent of

the total number of terms contained in clusters.

32. The NG data set used in these experiments is indexed in the space of

the obtained word clusters produced by DS clustering using sample

correlation coefficient as similarity measure and a sample variance

threshold of 0.015 (i.e., 18,046 preselected terms), with the rest of parameters

set to default values.

should be interpreted under the perspective of the data set

representation sparsity. As extensively commented in

Section 8.1.1, when indexing NG collection with 20 word

clusters, each word cluster (nearly) univocally identifies a

distinct category. This means that the rest of word-clusters

indexes are set (on average) to zero with a very narrow

(practically null) variance. The effect of null mean and

null variance indexing attributes results to be extremely

negative. For instance, a residual not null weight in one of

those “null” indexing attributes may easily generate a

0 probability term in (20) that will eventually overimpose

to other nonnull components of the expression. The idea

pursued by our hybrid GNB is to set a lower-bound value

for variance.

To get a more precise idea of what a variance lower bound

of 0.015 means, we can make use of the 68-95-99.7 rule or

empirical rule that characterizes a normal or Gaussian

distribution. As stated by this rule, 95 percent of the values

lie in the interval j Æ2o. Setting a lower-bound variance of

0.015 means that the narrowest Gaussian distribution

allowed concentrates 95 percent of its values in the interval

j Æ0.25. This seems to be a reasonable superimposed

restriction in a normalized scheme as the one we work on,

where documents are cosine-normalized, and thus, attribute

weights vary in the interval [0; 1].

The effect of variance lower-bound tuning in hybrid

Q-GNB classification has been further studied. Optimal

results were obtained for values greater than 0.01 (for which

2o ¼ 0.2). For values higher than this, Q-GNB shows a

slight maximum at 0.015 (for which 2o ¼ 0.25). Surpris-

ingly, classification results for a variance lower bound of 0.1

(for which 2o ¼ 0.63) are stably good. This may be justified

by the robustness of word-clusters indexing of NG.

9.2 Reuters-21578

In Fig. 11b, it can be observed

33

that hybrid Q-GNB

presents good classification accuracy, similar to the one

obtained by SVM and only surpassed by MNB. These

results should be interpreted in the same sense as

done with NG. Introducing a variance lower bound holds

the negative effect of null mean and null variance indexes

due to data set indexing matrix sparsity. The effect of

variance lower-bound tuning in hybrid Q-GNB classifica-

tion has also been analyzed. As in NG, it has been seen that

small values of variance lower bound degrade classification

accuracy, which explains the bad performance of native

GNB categorizers. Optimal results are obtained for values

in the range of 0.005-0.01 (for which 0.14 2o 0.2). For

values higher than this, hybrid Q-GNB suffers a drastic

accuracy descent. Forcing the widening of the Gaussian

distributions too much introduces errors and confusion in

the classification. This fact, which is reasonable, was

surprisingly not occurring with NG eventually due to the

neater separation between categories.

10 CONCLUSIONS

The theoretical model we have proposed has led to a

performing two-level term-space reduction scheme, imple-

mented by a noisy term filtering and a subsequent

redundant term compression.

In particular, the elimination of noisy terms based on the

sample variance of their category-conditional PMF has

experimentally proved to be an innovative and correct

procedure. Both in 20 Newsgroups and Reuters-21578

collections, we have seen that a range of values for the

sample variance threshold t exists where a maximum of

classification accuracy is reached. Outside these bounds, the

classification accuracy decreases due to noisy terms inter-

ference (at the lower bound) and informative terms over-

elimination (at the upper bound).

On the other hand, the results obtained by our signal

processing inspired redundancy compressors to allow an

indexing term-space reduction factor of ¸ ¼ 0.9982 (from

original 113,357 words to only 20 word clusters) with a

residual loss of classification accuracy of 2.82 percent

34

with

MNB classifier.

We have enlarged our research with tough Reuters-21578

data set, which is an highly nonuniformly distributed text

collection where categories are related and overlapped.

Results obtained are extremely satisfactory and clearly

outperform those formerly published by Bekkerman et al.

[7] with other distributional clustering procedures. Our

clustering method, with sample correlation coefficient as

similarity measure, allows an indexing term-space reduction

factor of ¸ ¼ 0.9938 (from original 32,539 words to only

200 word clusters) with a gain of classification accuracy of

4.53 percent

35

when using the MNB classifier. When

adopting SVM, the loss of classification accuracy is only

2.24 percent. In any case, the overall maximum of classifica-

tion accuracy is reached when the collection is indexed by

200 clusters with MNB classifier. These results tend to

indicate that MNB significantly benefits from the compres-

sion of the term space (and its intrinsic overfitting reduc-

tion). SVM is, arguably [2], more robust to overfitting, and

thus, less prone to be positively affected by the compression.

By reducing redundancy, feature extraction by term

clustering tends to produce a set of orthogonal basis for the

CAPDEVILA AND M

ARQUEZ FL

**OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1039
**

Fig. 10. Reuters-21578 training set indexed by 200 clusters. View

restricted to the 50 most representative clusters.

33. Same experimental scenario as in Section 8.2.1.

34. With respect of an indexing of 5,000 word clusters.

35. With respect of an indexing of 10,000 terms selected with classical IG

or CHI functions.

document representation. Basically, each (main) category is

identified by a singular and discriminative cluster, thus

drawing up the compressed documents to an orthogonal

coding of the category. In all, with both collections, there

seems to be a relationship between the entropy of the

category distribution and the actual optimal number of

clusters. 20 Newsgroups is a practically uniform distributed

collection. Its normalized entropy

36

is 0.9981 (almost 1).

Reuters-21578 is an extremely unevenly distributed collec-

tion. Its normalized entropy is 0.4881 (almost 1/2). Experi-

mentally, 20 Newsgroups needs as many clusters as

categories, while Reuters-21578 needs half of it.

In ATC, MNB is one of the most popular statistical

categorizer because of its simplicity and good accuracy

results. Gaussian assumption [12], [13] has been poorly

applied to ATC due to intrinsic problems related to the

high dimensionality of the typical document vectorial

representation. Our purpose has been, once the represen-

tation space has been optimally reduced, to experimen-

tally test how Gaussian MAP categorizers, specially under

the Naive Bayes assumption, may be adapted to the

concomitance of sparsity in ATC. By establishing a

variance lower bound in Gaussian PDFs, we have rescued

the use of Gaussian MAP classifiers in the ATC arena.

Our hybrid approach reaches classification results com-

parable to those obtained by MNB and SVM and opens

the door for further research.

We are currently pursuing our work with the design of a

divisive clustering algorithm, which, in view of the results

obtained with the tested agglomerative clustering schemes,

we think can throw interesting improvements both in

classification effectiveness and computational efficiency

terms. We have also envisaged to establish a thorough

similarity measures comparison and analysis. Future work

is also foreseen in the communication theoretical modeling

aspect, with special stress on the synthesis of prototype

documents via the generative model proposed, as well as

the deepening on the document coding (and subsequent

decoding) optimal design.

ACKNOWLEDGMENTS

The authors wish to thank the anonymous reviewers for

their fruitful comments and the Weka Machine Learning

Project for making their software open source under a GPL

license. For this research, Marta Capdevila was supported

in part by a predoctoral grant from the R&D General

Department of the Xunta de Galicia Regional Government

(Spain), awarded on 19 July 2005.

REFERENCES

[1] F. Sebastiani, “Machine Learning in Automated Text Categoriza-

tion,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

[2] T. Joachims, “Text Categorization with Support Vector Machines:

Learning with Many Relevant Features,” Proc. 10th European Conf.

Machine Learning (ECML), pp. 137-142, 1998.

[3] T. Joachims, Learning to Classify Text Using Support Vector

Machines—Methods, Theory, and Algorithms. Kluwer/Springer,

2002.

[4] L.D. Baker and A.K. McCallum, “Distributional Clustering of

Words for Text Classification,” Proc. Special Interest Group on

Information Retrieval (SIGIR ’98) 21st ACM Int’l Conf. Research and

Development in Information Retrieval, pp. 96-103, 1998.

[5] N. Slonim and N. Tishby, “The Power of Word Clusters for Text

Classification,” Proc. 23rd European Colloquium on Information

Retrieval Research, 2001.

[6] I. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information-

Theoretic Feature Clustering Algorithm for Text Classification,”

J. Machine Learning Research (JMLR), special issue on variable and

feature selection, vol. 3, pp. 1265-1287, 2003.

[7] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distribu-

tional Word Clusters vs. Words for Text Categorization,”

J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.

[8] A. McCallum and K. Nigam, “A Comparison of Event Models for

Naive Bayes Text Classification,” Proc. Assoc. for the Advancement of

Artificial Intelligence (AAAI ’98) Workshop Learning for Text

Categorization, 1998.

[9] Y. Yang and X. Liu, “A Re-Examination of Text Categorization

Methods,” Proc. 22nd Ann. Int’l ACM Special Interest Group on

Information Retrieval Conf. (SIGIR ’99), pp. 42-49, Aug. 1999.

[10] S. Haykin, Communication Systems. John Wiley & Sons, 2001.

[11] T.M. Cover and J.A. Thomas, Elements of Information Theory,

second ed. John Wiley & Sons, Inc., 2006.

[12] H. Schutze, D. Hull, and J. Pedersen, “A Comparison of Classifiers

and Document Representations for the Routing Problem,” Proc.

18th Ann. Int’l ACM Special Interest Group on Information Retrieval

(SIGIR ’95) Conf. Research and Development in Information Retrieval,

pp. 229-237, 1995.

[13] T. Li, S. Zhu, and M. Ogihara, “Using Discriminant Analysis

for Multi-Class Classification: An Experimental Investigation,”

Knowledge and Information Systems, vol. 10, no. 4, pp. 453-472,

2006.

[14] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning

Tools and Techniques, second ed. Morgan Kaufmann, 2005.

[15] F. Debole and F. Sebastiani, “An Analysis of the Relative Hardness

of Reuters-21578 Subsets,” Proc. Fourth Int’l Conf. Language

Resources and Evaluation (LREC ’04), pp. 971-974, 2004.

[16] K. Torkkola, “Linear Discriminant Analysis in Document Classi-

fication,” Proc. IEEE Int’l Conf. Data Mining (ICDM-2001) Workshop

Text Mining (TextDM ’01), 2001.

1040 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

36. Normalized entropy is the fraction between entropy and maximum

entropy.

Fig. 11. Comparison of Gaussian Naive Bayes categorizers classification accuracy against state-of-the-art SVM and MNB performance.

(a) 20 Newsgroups—Hybrid GNB with variance lower bound of 0.015. (b) Reuters-21578—Hybrid GNB with variance lower bound of 0.008.

[17] Y. Yang and J.O. Pedersen, “A Comparison Study on Feature

Selection in Text Categorization,” Proc. Int’l Conf. Machine

Learning, pp. 412-420, 1997.

Marta Capdevila received the engineering

degree in telecommunications from the Poly-

technic University of Catalonia (UPC), Barcelo-

na, Spain, in 1992. She is currently working

toward the PhD degree. During 1991, she

studied image processing at the Ecole Nationale

Supe´ rieure des Te´ le´ communications, Paris,

France. From 1992 to mid-1993, she was

selected for a young graduate trainee contract

at the European Space Agency (ESA), Frascati,

Italy. After this stage, and until mid-1994, she held an application

engineering post at the pan-European research networking company

DANTE, Cambridge, United Kingdom. From 1995 to 2001, she was

appointed to several positions in the Spanish industry. Since 2001, she

has been involved in research activities at the Telecommunication

Engineering School, University of Vigo, Spain. Her research interests

include automatic text categorization and term-space compaction and

compression.

Oscar W. Ma´ rquez Flo´ rez received the tele-

communication engineering degree in 1985 from

the Odessa Electrotechnical Institute of Com-

munication, Ukraine, and the doctorate degree

in telecommunications in 1991 from the Ruhr-

University Bochum, Germany. In 1992, he joined

the Telecommunication Engineering faculty of

the University of Vigo, where he is currently an

associate professor. In addition to teaching, he

is involved in research in the areas of signal

processing in digital communications, computer-based learning, statis-

tical pattern recognition, and image-based biometrics. He is a member

of the IEEE.

> For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1041

A Communication Perspective on Automatic

A Communication Perspective on Automatic

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd