Master Thesis v11

I
Mining-based Category Evolution for Text Databases

:

II
88
2

( )
( ) Mi ni ng- Based Category Evol uti on f or Text Databases

8742605

68
( )
( )
( ) Dong
( ) Yuan- Xi n
( )
( ) Chi h- Pi ng Wei , Ni an- Shi ng Chen
( )
( ) Text Categori zati on, Category Evol uti on, Category
management, Cl usteri ng
( )

(Mi CE) Agrawal
(1999)
(Mi CE)

(Mi CE)

III

( ) As text repositories grow in number and size and global
connectivity improves, the amount of online information in the
form of free- format text is growing extremely rapidly. In many
large organizations, huge volumes of textual information are
created and maintained, and there is a pressing need to support
efficient and effective information retrieval, filtering, and
management. Text categorization is essential to the efficient
management and retrieval of documents. Past research on text
categorization mainly focused on developing or adopting
statistical classification or inductive learning methods for
automatically discovering text categorization patterns from a
training set of manually categorized documents. However, as
documents accumulate, the pre-defined categories may not
capture the characteristics of the documents. In this study, we
proposed a mining-based category evolution (MiCE) technique
to adjust the categories based on the existing categories and
their associated documents. According to the empirical
evaluation results, the proposed technique, MiCE, was more
effective than the discovery-based category management
approach, insensitive to the quality of original categories, and
capable of improving classification accuracy.

IV
TABLE OF CONTENTS

TABLE OF CONTENTS ................................................................................................I
LIST OF FIGURES ...................................................................................................... V
LIST OF TABLES .......................................................................................................VI
CHAPTER 1 Introduction..............................................................................................1
1.1 Background ......................................................................................................1
1.2 Research Motivation and Objective .................................................................2
1.3 Organization of the Thesis ...............................................................................5
CHAPTER 2 Literature Review....................................................................................6
2.1 Text Categorization..........................................................................................6
2.1.1 Preprocessing Step ................................................................................7
2.1.2 Representation Step............................................................................. 11
2.1.3 Induction Step .....................................................................................14
2.2 Discovery-based Category Management Approach......................................28
CHAPTER 3 Mining-based Category Evolution (MiCE) Technique..........................29
3.1 High-level Architecture of Text Categorization with Categorization
Evolution.............................................................................................................29
3.2 Algorithm of Mining-based Category Evolution (MiCE) Technique ............30
3.2.1 Category Decomposition....................................................................32
3.2.2 Category Merging ...............................................................................37
3.3 Complete MiCE Algorithm............................................................................40
CHAPTER 4 Evaluations of Mining-based Category Evolution................................41
4.1 Test Data Set ..................................................................................................41
4.2 Effectiveness of Category Evolution.............................................................42
4.2.1 Evaluation Procedure ..........................................................................43
4.2.2 Evaluation Criteria ..............................................................................43
4.2.3 Result ..................................................................................................47
4.3 Sensitivity to the Quality of Categories .........................................................51
4.4 Effect of Category Evolution on Categorization Accuracy...........................53
4.4.1 Evaluation Procedure ..........................................................................53
4.4.2 Results .................................................................................................56
CHAPTER 5 Conclusions and Future Research Directions ........................................57
REFERENCES ............................................................................................................59

V
LIST OF FIGURES

FIGURE 1.1: ARCHITECTURE OF LEARNING-BASED TEXT
CATEGORIZATION.............................................................................................2
FIGURE 2.1: MAIN PROCESSES OF AUTOMATIC LEARNING TEXT
CATEGORIZATION PATTERNS.........................................................................7
FIGURE 2.2: SAMPLE NEWS DOCUMENT AND FEATURE SET.......................13
FIGURE 2.3: ID3 ALGORITHM................................................................................17
FIGURE 2.4: CN2 ALGORITHM [CN89] .................................................................19
FIGURE 2.5: RULE SET FOR CATEGORY IRELAND CONSTRUCTED BY
RIPPER (ADAPTED FROM [CS99]) .................................................................21
FIGURE 2.6: THE RIPPER ALGORITHM................................................................23
FIGURE 2.7: THE SWAP-1 ALGORITHM...............................................................25
FIGURE 3.1: HIGH-LEVEL ARCHITECTURE OF TEXT CATEGORIZATION
WITH CATEGORY EVOLUTION.....................................................................30
FIGURE 3.2: PROCESS OF MICE TECHNIQUE.....................................................31
FIGURE 3.3: GRAPH FOR MERGING DECISIONS...............................................39
FIGURE 3.4: MICE ALGORITHM............................................................................40
FIGURE 4.1: SAMPLE NEWS DOCUMENT ...........................................................42
FIGURE 4.2: EXAMPLES OF CATEGORY ASSIGNMENT FOR TESTING
DOCUMENTS.....................................................................................................54
FIGURE 4.3: EXAMPLES OF CATEGORY ASSIGNMENT (AFTER CATEGORY
EVOLUTION) FOR TESTING DOCUMENTS .................................................55

VI
LIST OF TABLES

TABLE 2.1: REPRESENTATION OF NEWS DOCUMENT SHOWN IN FIGURE
2.2.........................................................................................................................14
TABLE 3.1: EXAMPLES OF DOCUMENTS ............................................................34
TABLE 4.1: SUMMARY OF TEST DATA SET........................................................42
TABLE 4.2: EXAMPLE OF TRUE CATEGORIES AND EVOLVED CATEGORIES
..............................................................................................................................45
TABLE 4.3: EXPERIMENTAL RESULT (
s
=0.6,
m
=0.4)...................................47
TABLE 4.4: PURITY OF EVOLVED CATEGORIES ...............................................49
TABLE 4.5: DIVERSITY OF EVOLVED CATEGORIES.........................................49
TABLE 4.6: SPECIFICITY OF EVOLVED CATEGORIES ......................................49
TABLE 4.7: PERFORMANCE OF THE DISCOVERY-BASED CATEGORY
MANAGEMENT APPROACH..........................................................................50
TABLE 4.8: SENSITIVITY EXPERIMENT: PURITY OF EVOLVED
CATEGORIES .....................................................................................................52
TABLE 4.9: SENSITIVITY EXPERIMENT: DIVERSITY OF EVOLVED
CATEGORIES .....................................................................................................52
TABLE 4.10: SENSITIVITY EXPERIMENT: SPECIFICITY OF EVOLVED
CATEGORIES .....................................................................................................52
TABLE 4.11: COMPARISON OF CLASSIFICATION ACCURACY.......................56

1
CHAPTER 1
Introduction

1.1 Background
As text repositories grow in number and size and global connectivity improve, the
amount of online information in the form of free-format text is growing extremely
rapidly. In many large organizations, huge volumes of textual information are created
and maintained, and there is a pressing need to support efficient and effective
information retrieval, filtering, and management. Text categorization, which refers to
the assignment of textual documents to one or more pre-defined categories based on
their content [DPH98, YC94], is essential to the efficient management and retrieval of
documents. Benefits of a category hierarchy include the ability to quickly locate a
document without having to remember the exact keywords contained in that
document, and the ability to easily browse a set of related documents [ABS99]. Text
categorization can also play an important role in many applications, including
real-time sorting of email or files into folder hierarchies [DPH98], topic identification
to support topic-specific search and browsing [DPH98, ADW94], event tracking
[YPC98, YCB99] to assigning news stories into pre-defined events, and text routing
to automatically classify documents so that relevant texts can be routed to appropriate
users [RH96].

Text categorization was typically performed by trained professionals. The manual text
categorization process is very time-consuming and costly, thus limiting its
applicability. To overcome the inefficiency of the manual process, the learning-based
text categorization has emerged, as shown in Figure 1.1. Several methods for
2
automatically discovering text categorization patterns, based on a training set of
manually categorized documents, have been proposed. For example, Apte et al.
[ADW94] adopted Swap-1 rule induction method, and Lohen and Singer [CS99]
proposed a rule induction method, called RIPPER [WI93], for automatic learning of
decision rules for text categorization. Ragas and Koster [RK98] adopted Simple
Bayesian Classifier [DH74], based on the Bayesian probability, for learning text
categorization patterns. Once the text categorization patterns are discovered, new
documents can be categorized accordingly.

Figure 1.1: Architecture of Learning-based Text Categorization

1.2 Research Motivation and Objective
Past research on text categorization mainly focused on developing or adopting
statistical classification or inductive learning methods for automatically discovering
text categorization patterns from a training set of manually categorized documents.
However, existing statistical classification or inductive learning methods for text
categorization assume a pre-defined and static set of categories. That is, the set of
categories will not evolve over time. This assumption is not realistic since the
pre-defined categories may not capture the characteristics of documents. Some
Pre-classified
Documents
Documents to
be categorized
Text Categorization
Pattern Learner

Text Categorization
Patterns
Reasoner
Categories
Category Assignment
for the Document
3
categories may be too broad. Thus, they need to be refined into several categories
which are narrower in scope. On the other hand, some categories may be too
overlapped. Thus, merging them into a broader category may be necessary. Let s use
two applications to illustrate the needs of refining existing document categories.

Application: E-mail System
Following the development of Internet, more and more people communicate by
E- mail. Because of convenience and immediateness, population that using
E- mail grow up rapidly. Most E- mail users simply allow their messages to
accumulate in the Inbox where these messages remain until they are deleted. In
face of hundreds of E- mails everyday, it is time consuming that people peruse
these e- mails. Learning-based text categorization provided a solution of the
management of E-mails. It learns classified rules with each folder (category),
and classifies new E- mail into appropriate folders (categories). It can decrease
E-mail users effort of arrangement of E-mail and filtering useless message.

As more e-mails are accumulated, the original folders may become
inappropriate. Some folders may cover several topics while others may overlap
with some others. It is essential to re-organize the folder hierarchy.

Application: News Services
Following the development of World Wide Web, more and more services were
provided in Web. The most common service is News service. Text categorization
provided a nice solution to classify news documents. Following the increasing of
news documents, some news categories will expand bigger and become more
complex. The accuracy of classification may be affected. It would be time
4
consuming and inefficient to manually adjust categories and re-classify existing
documents. An automatic approach would be desired.

In [ABS99], Agrawal et al. applied text- mining techniques for creating, exploiting,
and maintaining a hierarchical arrangement of textual documents. Their system aims
at discovery topics embedded in existing documents, ignoring the existent categories.
We call their approach as a discovery-based category management approach.
However, any adjustment to the document categories will result in the re-organization
of existing documents which in turn will users cognitive load increase when
browsing the existing documents through the category hierarchy. To lessen such
cognitive load resulted from the adjustment of the category hierarchy, the evolution
approach rather than the discovery approach should be taken. The evolution approach
is to refine the existing category hierarchy by decomposing or merging categories.
Since the new categories are evolved from existing ones, the change on the categories
is not as dramatic as those created by the discovery-based approach. Thus, the
category evolution is more desired than the category discovery when a category
hierarchy is pre-existed.

Past research on the document category management focused on the discovery-based
approach. Motivated by the need of category evolution, this thesis research aims at
developing a mining-based category evolution technique. The proposed technique will
empirically be evaluated.

5
1.3 Organization of the Thesis
The rest of the thesis is organized as follows. In Chapter 2, the related literature will
be reviewed, including text categorization algorithms and the discovery-based
category management approach. The development of the mining-based category
evolution technique is depicted in Chapter 3, followed by the empirical evaluation
experiments in Chapter 4. Chapter 5 concludes with a summary and directions for
future work.
6
CHAPTER 2
Literature Review

In this chapter, we will review literature related to the thesis, including learning-based
text categorization and the discovery-based category management.

2.1 Text Categorization
As mentioned, text categorization refers to the assignment of textual documents to
one or more pre-defined categories based on their content [DPH98, YC94]. The
challenging research issue of text categorization is the development of statistical
classification or inductive learning methods for automatically discovering text
categorization patterns, based on the training set of manually categorized documents.
In general, automatic learning text categorization patterns encompasses three main
steps [ADW94], as shown in Figure 2.1, including preprocessing, representation and
induction.
1. The preprocessing step is to determine the set of features that will be used for
representing the individual documents within a collection. This is essentially the
dictionary creation process.
2. The representation step is to map each individual document into a training sample
using the dictionary generated in the previous step, and to associate it with a label
that identifies its category.
3. The induction step is to find patterns (e.g. decision trees, decision rules, etc.) that
distinguish categories from one another.

7

Figure 2.1: Main Processes of Automatic Learning Text Categorization Patterns

2.1.1 Preprocessing Step
The first step is to produce a set of attributes (or features), called dictionary, from the
training set of manually categorized documents, such that a document text can be
represented by a set of features in the representation step. The text portion of the
training documents is scanned to produce a list of nouns or noun phrases in which
neither noun or noun phrase belongs to a pre-defined list of stop words or is a number
or part of a proper name. To reduce the number of features, those words (i.e., nouns or
noun phrases), those features with low frequency counts are first removed from the
dictionary since features recurring only a few times are not statistically reliable
indicators [CH89, L92b].

Two possible ways to originate features [ADW94] include:
Pre-classified
Documents
Noun Phrase
Extraction
Feature Selection
Noun Phrases in
documents
Representation
Feature Set for
Categories
Categorization
Pattern Induction
Text
Categorization
Patterns
Documents represented
by Feature Set
Preprocessing Step
Representation Step
Induction Step
8
1. Universal Dictionary (from all texts in text databases):
A universal dictionary is created for all topics documents in text databases, and
feature selection is used to select words and phrases from this dictionary to solve a
specific text categorization problem.

2. Local Dictionary (from the relevant texts only):
Words found in documents on the given topic are entered in the local dictionary. We
create a dictionary for each topic to represent documents. Features in local dictionary
are selected to suit each topic.

Feature Selection
The phase of feature selection reduces unnecessary words (nouns or noun phrases) for
each category. It not only cuts down the loads of learning algorithms, but also reduces
bias in raw data and increases the effect of learning result [DPH98]. Several feature
selection methods have been proposed in the literature: frequency, TF*IDF,
correlation coefficient,
2
metric, mutual information, and centroid dice coefficient.
The features chosen are the top k features with the highest feature selection metric
score.
(1) Frequency [NGL97]:
This feature selection is based on the normalized frequencies of words in the training
documents.
( )( ) ( )
n n
f t f t f t ....
2 2 1 1

where
i
f is normalized frequency for the word t
i
.
Normalized frequencies are used so that training documents of different lengths are
normalized to contribute equally during training. The features chosen are the top k
features with the highest normalized frequency.
9

(2) TF*IDF:
TF*IDF is defined as following [T99]:
1
1
]
1
+ 1 log ) , ( *
) , (
i
i
x
c x i
df
N
freq c x IDF TF
1 log ) ( +
i
x
i
df
N
x IDF
Where
) , ( c x
i
freq is the frequency of the word
i
x that appears in the category c,
i
x
df is the number of categories that appear the word
i
x , and
N is the total number of categories.

The Inverse Document Frequency (IDF) is the word-weight of
i
x for the category c.
We will get a higher IDF value, if the feature appeared in the category c only.

(3) Correlation Coefficient:
The correlation coefficient C of a word w relevant to a particulate category M is
defined as [NGL97]:
( )
( )( )( )( )
+ + + +
+ +
+ + + +
n r n r n n r r
n r n r
N N N N N N N N
N N N N N
C
+ r
N is the number of documents in the category M in which the word w
occurs,
r
N is the number of documents in the category M in which the word w
does not occur,
+ n
N is the number of documents in other categories (not M) in which the
word w occurs,
10
n
N is the number of documents of other category (not M) in which the
word w does not occur, and
N is the total number of documents.

(4)
2
metric [SHP95]:
2 2
C
where C is the correlation coefficient
Correlation coefficient, a variant of the
2
metric, X can be viewed as a one-sided
metric. Correlation coefficient selects exactly those words that are highly indicative of
membership in a category, whereas the
2
metric will not only select from this set of
words but also those words that are indicative of non-membership in the category.

(5) Mutual Information [DPH98, L92a]:
The mutual information ( ) M w MI , between a feature w and a category M is defined
as:
( )
{ }
( )
( )
( ) ( )
{ }

1 , 0 1 , 0
,
log , ,
M w
M P w P
M w P
M w P M w MI
where P(w, M) is the probability that the word w and the category M present
together;
P(w) is the probability that the word w appears;
P(M) is the probability that the category M appears.

The k features whose mutual information is largest for each category are selected.

(6) Centroid Dice Coefficient [T99]:
Centroid dice coefficient is defined as below:
11
) ( ) (
) , ( 2
) , (
c P x P
c x P
c x MI
i
i
i CD
+

Where
) , ( c x P
i
is the probability that the word
i
x and the category c appear
together,
) (
i
x P is the probability that the word
i
x appears, and
) (c P is the probability that the category c appears.

2.1.2 Representation Step
Given a dictionary (local or universal) created in the previous step, each training
document is then represented in terms of the features in the dictionary. A document is
labelled to indicate the category membership. Each document consists of a value for
each feature in the dictionary, where the value can be either boolean (e.g., indicating
whether or not the feature appears in the document), or numerical (e.g., frequency of
occurrence in the document being processed). To represent the value of the feature
j
w for the ith document, different representation methods can be employed [S89]:
1. Binary:
ij
a = 1 if the feature
j
w is present in the ith document, and
ij
a = 0 otherwise.
2. Within-document frequency (TF):
ij
a =
ij
TF = the number of times the feature
j
w occurs in the ith document.
3. Inverse document frequency (IDF):
ij
a =
j
IDF =

,
_
j
n
N
log +1
where N is the number of documents in the entire collection, and
j
n is the number of documents in which the feature
j
w is present.
12
4. TF*IDF:
ij
a =
. j ij
IDF TF

In the following, we will use a Reuter news document (as shown in Figure 2.2(a)) to
illustrate the four representation methods.
13

Date: 27 April, 1987
Category: Trade
U.S. WARNS OF TRADE BILL, EC ATTACKS JAPAN
The U.S. warned its major trade partners that its trade deficit must fall by September or a
protectionist trade bill from Congress would be highly likely.
Meanwhile, European Community (EC) external trade chief Willy de Clercq said that if
Japan's trade surplus, which hit almost 90 billion dlrs last year, continued so high, there
would be stormy weather ahead.
U.S. Trade representative Clayton Yeutter told trade leaders from Japan, the EC and
Canada that there was at least a 50-50 chance that a protectionist bill reaching the House
of Representatives this week would pass the Senate in September.
The U.S. economy badly needed better trade figures by then, or President Ronald Reagan
would have a difficult time vetoing such a bill, he said, according to a series of briefings
to reporters by official delegates at the weekend meeting.
A 15 billion dlr U.S. Trade deficit in March had only incensed Congress further, he said.

(a) Sample News Document

Feature Set
F
1
Trade (IDF value = 3.2)
F
2
Japan (IDF value = 1.2)
F
3
tariffs (IDF value = 1.1)
F
4
goods (IDF value = 1.0)
F
5
U.S. (IDF value = 1.3)
F
6
exports (IDF value = 5.1)
F
7
countries (IDF value = 1.7)
F
8
imports (IDF value = 4.3)
F
9
agreement (IDF value = 2.5)
F
10
legislation (IDF value = 2.1)
F
11
deficit (IDF value = 5.3)
F
12
markets (IDF value = 2.6)
F
13
nations (IDF value = 1.5)
F
14
Congress (IDF value = 2.3)
F
15
trade surplus (IDF value = 8.7)
(b) Feature Set for Category Trade
Figure 2.2: Sample News Document and Feature Set
As shown in Figure 2.2(b), assume fifteen features be selected in the local dictionary
(relevant to the trade category). The IDF value for each feature is also shown in
14
Figure 2.2(b). The document is represented using the four methods described above
and shown in Table 2.1.

Representation F
1
F
2
F
3
F
4
F
5
F
6
F
7
F
8
F
9
F
10
F
11
F
12
F
13
F
14
F
15

Binary 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1
Term Freq. 10 3 0 0 5 0 0 0 0 0 2 0 0 2 1
IDF 3.2 1.2 1.1 1.0 1.3 5.1 1.7 4.3 2.5 2.1 5.3 2.6 1.5 2.3 8.7
TF*IDF 32 3.6 0 0 6.5 0 0 0 0 0 10.6 0 0 4.6 8.7
Table 2.1: Representation of News Document Shown in Figure 2.2

2.1.3 Induction Step
The induction step is to automatically discover text categorization patterns that
distinguish categories from one another, based on a training set of manually
categorized documents. In this view, it is the same as the data classification discussed
in the data mining area. Formally speaking, data classification is the process which
finds the common properties among a set of objects in a database and classifies them
into different classes, according to a classification model [BL97]. To construct such a
classification model, a sample database E is treated as the training set, in which each
tuple consists of the same set of attributes (or features) and, additionally, each tuple
has a known class identity (label) associated with it. The objective of the classification
is to first analyze the training data and develop an accurate description or a model for
each class using the features available in the data. Such class descriptions are then
used to classify future test data or to develop a better description (call classification
rules) for each class in the database. Some well-known techniques for data
15
classification include decision tree, decision rule, neural networks, and k- nearest
neighbor.

Decision Tree induction Techniques:
A decision-tree-based classification method is a supervised learning method that
constructs a decision tree from a set of training examples. Usually, it is a greedy tree
growing algorithm which constructs a decision tree in a top-down recursive
divide-and-conquer strategy [Q86, Q93]. The decision tree starts as a single node
containing the training examples. If the training examples are all of the same class,
then the node becomes a leaf and is labeled with that class. Otherwise, the algorithm
selects a best attribute according to some metric (e.g., information gain) and grows
the decision tree. In the following, we will review the most well- known decision tree
induction technique, ID3, and its descendant C4.5.

ID3 [Q86] uses an information theoretic approach for attribute selection, aiming at
minimizing the expected number of attribute tests to classify an training example. It
guarantees a simple, but not necessarily the simplest decision tree. Let the data be a
set P of p training examples. Suppose there exist m distinct class P
i
(for i = 1, . , m).
Suppose P contains p
i
examples for each P
i
class, then an arbitrary example belongs
to the class P
i
with probability p
i
/p. The excepted information needed to classify a
given example (or called the entropy of the set of examples P) is given as:
p
p
p
p
p p p I
i
m
i
i
m

1
2 2 1
log ) ,...., , ( .

An attribute A with values {a
1
, a
2
, , a
k
} can be used to partition P into {C
1
, C
2
, ,
C
k
}, where C
j
contains those examples in C that have value a
j
of A. Let C
j
contain p
ij

16
examples of class P
i
. The expected information based on the partitioning by A (or
called the entropy of the partial tree resulted by A) is given as:
+ +
k
j
mj j
mj j
p p I
p
p p
A E
1
1
1
) ,...., (
...
) ( .
Thus, the information gained by branching on A is:
) ( ) ,..., , ( ) (
2 1
A E p p p I A gain
m
.

Among all attributes available, the attribute resulted in the maximal information gain
is selected as the decision attribute at the node. A branch is created for each value
of the decision attribute, and the training examples are partitioned accordingly. The
algorithm uses the same process recursively to form a decision tree for each partition.
The recursive partitioning stops only when all examples at a given node belong to the
same class, or when there are no remaining attributes on which the examples may be
further partitioned. The algorithm of ID3 is shown in Figure 2.3.
17

Process_Node (N)
Initialize set of class attribute values at node, N, to empty.
Initialize the set of child nodes of N to empty.
FOR each training instance currently in the sample of training data
available at node N.
Add its class attribute value to set of class attribute value at
node N.
END
IF (number of unused attributes available for N > 0) AND (number
of class attribute values for N > 1)
Group_Values (N)
/* Place values in groups for entropy calculation */
Select_branching_Attribute (N)
/* Using entropy, select branching attribute */
Set Child_Nodes (N)
/* Build set of child nodes for node N */
FOR each child_node of N
Process_Node (child_nodes)
END
END
RETURN

Group_Values (N)
FOR each attribute available for node N
Initialize set of branching values for attribute A
i
to empty
FOR each data item currently in training sample get set of
values for attribute A
i
at node N
END
END
RETURN

Figure 2.3: ID3 Algorithm

As mentioned, ID3 uses information gain as the evaluation function for attribute
selection. There exist other evaluation functions, such as Gini index, chi-square test,
etc. [BFOS84, K96, P91, WK91]. For example, if a data set T contains examples from
n classes, the Gini index, gini(T) is defined as,

2
1 ) (
i
p T gini
where
i
p is the relative frequency of class i in T.

On the other hand, C4.5, a descendant of ID3, incorporates additional features over
the basic ID3 algorithm. Initial definition of ID3 assumes discrete valued attributes.
C4.5 incorporates the capability of handling continuous valued attributes in the
18
decision tree building process. To handle training examples with missing attribute
values, C4.5 assigns the most common value in the training data set or the most
common value with the same class. To avoid overfitting in the tree induction, C4.5
adopts different pruning techniques [Q93].

Decision Rule Induction Technique:
Rule induction methods attempt to find a compact covering rule set that completely
partitions the examples into their correct classes [MMHL86, CN89]. The covering set
is found by searching heuristically for a single best rule that covers cases for only one
class. Having found a best conjunctive rule for a class C, the rule is added to the rule
set, and the cases satisfying it are removed from further consideration. The process is
repeated until no cases remain to be covered. Once a covering set is found that
separates the classes, the induced set of rules is further refined by either pruning or
statistical techniques. Using training-and-test evaluation methods, the initial covering
rule set is then scaled back to the most statistically accurate subset of rules.

In the following, a set of representative decision rule induction techniques will be
summarized. Some of which were proposed for general data classification problems,
while some others were proposed for the text categorization problem.

CN2 Technique:
The CN2 technique combines the efficiency and ability to handle noisy training data
of ID3 with its top down approach to tree generation, and the if-then rule form and
adaptable search strategy of the AQ family of algorithms [B95, CN89]. CN2 makes
its best effort to classify all new data by completing its rule set with the default rule.
The class associated with the default rule is the most common class in the training
19
data.

Figure 2.4 shows the algorithm of CN2. During each iteration, a complex, a conjunct
of selectors, is sought which covers the most number of instances of the training data
from class C(x
i
). When the best complex, BEST_CPX, is found for this iteration, the
rule
If <complex> then predict C(x
i
)
is added to the end of the rule list and all the instances of training data covered by this
rule are removed. This process terminates when no more complexes can be found.

Procedure_CN2 (E)
Let RULE_LIST be the empty list.
REPEAT UNTIL BEST_CPX is nil or E is empty:
Let BEST_CPX be Find_Best_Complex(E).
IF BEST_CPX is not nil,
THEN let E be the examples covered by BEST_CPX.
Remove from E the examples E covered by BEST_CPX.
Let C(x
i
) be most common class of examples in E .
Add the rule IF BEST_CPX the class C(x
i
) to the end of
RULE_LIST.
RETURN RULE_LIST.

Procedure Find_Best_Complex (E)
Let STAR be the set containing the empty complex.
Let BEST_CPX be nil,
WHILE STAR is not empty,
Specialize all complexes in STAR as follows:
Let NEWSTAR = {x y|xSTAR, ySELECTORS}.
Remove all complexes in NEWSTAR that are either in STAR
(i.e., the unspecialized ones) or null (e.g., big=y big=n).
FOR every complex CPX
i
in NEWSTAR:
IF CPX
i
is statistically significant and better than BEST_CPX
by user-defined criteria when tested on E,
THEN replace current value of BEST_CPX by CPX
i
.
REPEAT UNTIL size of NEWSTAR <= user-defined maximum.
Remove worst complex from NEWSTAR.
Let STAR be NEWSTAR.
RETURN BEST_CPX.
Initialize set of branching values for attribute A
i
to empty
FOR each data item currently in training sample get set of
values for attribute A
i
at node N
END
END
RETURN

Figure 2.4: CN2 Algorithm [CN89]
20

As the search proceeds, CN2 maintains a limited set, STAR, of best complexes found
so far. The algorithm performs a beam search analyzing only specializations of the set
STAR. A complex is specialized by adding a new conjunctive term. During the
learning process, CN2 makes two heuristic decisions. First, it evaluates the quality of
the complexes to determine a best complex. It determines the quality of a complex
using the entropy function used in ID3. Secondly, CN2 uses a likelihood ratio statistic
to test the significance of a complex, defined as follows.
k
i
i i i
e f f cpx Ratio Likelihood
1
) / log( 2 ) (
where
i
f is the observed frequency of examples for class i that satisfy
complex cpx, and
i
e is the expected frequency of examples for class i if
complex cpx has selected all covered examples with a probability identical
to that for examples in the entire training set.

Thus, the likelihood ratio statistic measures the distance between the two distributions
(
i
f and
i
e ), which, under proper assumptions, can be approximated as
2
x with k-1
degrees of freedom. This ratio provides a significance measure; that is, the lower the
ratio, the more likely that the observed regularity has resulted from chance and lacks
significance in the complex.

RIPPER Technique:
The classifier constructed by RIPPER [CS99] is a set of rules. This set of rules can be
interpreted as a disjunction of conjunctions. For instance, as shown in Figure 2.5, a
document d is considered to be in the category Ireland if and only if
(the word ireland appears in d) OR
(the word ira appears in d AND the word killed appears in d) OR

21
(the word ira appears in d AND the word shot appears in d) OR
(the word ira appears in d AND the word out appears in d)

ireland irelanddocument.
ireland iradocument, killeddocument.
ireland iradocument, killsdocument.
ireland iradocument, belfastdocument.
ireland iradocument, todocument, callsdocument.
ireland irishdocument, abortiondocument.
ireland iradocument, shotdocument.
ireland iradocument, out document.
else not_ireland.

Figure 2.5: Rule Set for Category Ireland Constructed by RIPPER (Adapted from
[CS99])

The algorithm of RIPPER (as shown in Figure 2.6) consists of two main stages. The
first stage is a greedy process which constructs an initial rule set. This stage is based
on an earlier rule- learning algorithm called incremental reduced error pruning (IREP)
[FW94], which in turn is based on earlier work due to Quinlan [Q90], Cohen [C93],
Brunk and Pazzani [BP91], and Pagallo and Haussler [PH90]. The second stage is an
optimization phase which attempts to further improve the compactness and
accuracy of the rule set.

Stage 1: Building an Initial Rule Set.
The first stage of RIPPER (called IREP*) is a set-covering algorithm: it
constructs one rule at a time and removes all examples covered by the new rule as
soon as the rule is constructed. To construct a rule, the uncovered examples are
randomly partitioned into two subsets, a growing set containing two-thirds of the
examples and a pruning set containing the remaining one-third. IREP* will first
grow a rule, and then simplify or prune the rule.
22

A rule is grown by repeatedly adding conditions to rule r
0
with an empty
antecedent. This is done in a greedy fashion: at each stage i, a single condition is
added to the rule r
i
, producing a longer and more specialized rule r
i+1
. The
condition added is the one that yields the largest information gain for r
i+1
relative
to r
i
[Q90]. Information gain is defined as
,
_
+
+
+

+
+
+
+
+
+
+
+
+ +
1 1
1
2 2 1 1
log log ) , (
i i
i
i i
i
i i i
T T
T
T T
T
T r r Gain
where
+
i
T is the number of positive examples in the growing set covered by rule
r
i
, and
i
T is the number of negative example in the growing set covered by rule
r
i
.

Stage 2: Optimizing a Rule Set.
After RIPPER stops adding rules, the rule set is optimized so as to further
reduce its size and improve its accuracy. Rules are considered in turn in the order
in which they were added. For each rule r, two alternative rules are constructed.
The replacement for r is formed by growing and then pruning a rule r , where
pruning is guided so as to minimize error of the entire rule set (with r replacing r)
on the pruning data. The revision for r is formed analogously, except that it is
grown by greedily adding literals to r, instead of to the empty rule. Finally, a
decision is made as to whether the final theory should include the revised rule, the
replacement rule, or the original rule. This optimization step can be repeated,
occasionally resulting in further improvements in a rule set. Experiments with a
23
large collection of prepositional learning benchmark problems indicate that two
rounds of optimization are usually sufficient, and so in RIPPER, this optimization
step is by default repeated twice.

function IREP*(Data)
begin
Data0 := copy(Data);
RuleSet := an empty rule set
While positive examples Data0 do
/* grow and prune a new rule */
split Data0 into GrowData, PruneData
Rule := GrowRule(GrowData)
Rule := PruneRule(Rule, PruneData)
add Rule to RuleSet
remove examples covered by Rule from Data0
/* check stopping condition */
if DL(RuleSet)>DL(RuleSet
opt
)+d
where RuleSet
opt
has lowest DL of any RuleSet
constructed so far
then
RuleSet := Compress(RuleSet, Data0)
Return RuleSet
endif
endwhile
RuleSet := Compress(RuleSet, Data0)
return RuleSet
end

function Optimize(RuleSet, Data)
begin
for each rule cRuleSet do
split Data into GrowData, PruneData
c := GrowRule(GrowData)
c := PruneRule(c , Prunedata)
guided by error of RuleSet-c+c
c&:=GrowRuleForm(c, GrowData)
c&:=PruneRule(c&, Prunedata)
guided by error of RuleSet-c+c&
replace c in RuleSet with best of c, c , c&
endfor
return RuleSet
end

function RIPPER(Data)
begin
RuleSet := IREP*(Data)
Repeat 2 times:
RuleSet := Optimize(Ruleset, Data)
UncovData := examples in Data not covered by rules in
RuleSet
RuleSet := RuleSet + IREP*(UncovData)
endrepeat
end

Figure 2.6: The RIPPER Algorithm

24

Swap-1 Technique:
Swap-1 was proposed by Apte, Damerau and Weiss [ADW94] for automatic learning
of decision rules for text categorization. Given a set of sample cases S, where each
case is composed of observed features and the correct classification, the problem is to
find the best rule set,
best
RS , such that the error rate on new cases, ) (
best true
RS Err , is
minimum. Swap-1 derives solutions posed in disjunctive normal form (DNF), where
each class is classified by a set of disjunctive production rules. Each term is a
conjunction of tests,
i
p , where
i
p is a proposition formed by evaluating the truth of
a binary-valued feature or by comparing a threshold to any of the values a numerical
feature assumes in the samples. Unlike the decision tree, where all the implicit
productions are mutually exclusive, a general DNF model does not require mutual
exclusivity of rules. With productions that are not mutually exclusive, rules for two
classes can potentially be satisfied simultaneously. Such conflicts can be resolved by
inducing rules for each class according to a class priority ordering, with the last class
considered a default class.

Swap-1 looks back constantly to see whether any improvement can be made before
adding a new test. The following steps are taken to form the single best rule: (a) make
the single best swap from among all possible rule component swaps, including
deleting a component; (b) if no swap is found, add the single best component to the
rule. As in Weiss et al. [WGT90], the best rule is evaluated as predictive values (i.e.,
percentage of correct decisions by the rule). For equal predictive values, maximum
case coverage is the secondary criterion. Swapping and component addition terminate
when 100% predictive value is reached [CFS94]. The detailed Swap-1 algorithm is
shown in Figure 2.7.
25

Input: S a set of training cases
Initialize R
1
:= empty set, k:=1, and C
1
:= S

repeat
create a rule B with a randomly chosen attribute as its left -hand side
while (B is not 100-percent predictive) do
make the single best swap for any component of B,
deleting the component, using cases in C
k

if no swap is found, add the single best component to B
endwhile
P
k
:= rule B that is now 100-percent predictive
E
k
:= cases in C that satisfy the single-best-rule P
k

R
k+1
:= R
k
P
k

C
k+1
:= C
k
{E
k
}
k:=k+1
until (C
k
is empty)
find rule r in R
k
that can be deleted without affecting performance on
cases in S
while (r can be found)
R
k+1
:= R
k
{r}
k:=k+1
endwhile
output R
k
and halt

Figure 2.7: The Swap-1 Algorithm

k-Nearest Neighbor Technique:
A k-Nearest Neighbor (k-NN) algorithm is one type of memory-based reasoning. The
memory-based reasoning (MBR) is a directed data mining technique that exploits
decision making based on past experience. By maintaining a database of known
records, MBR finds neighbors similar to a new record and uses the neighbors for
classification and prediction [BL97]. Two major operations are concerned by the
MBR: the distance (or similarity) function that assigns a distance (or similarity)
between any two records and the combination function that combines the result from
the neighbors to arrive at an answer.

One recent example of a k-NN algorithm for text categorization is known as the
Expert Network (ExpNet) [Y94, LH98]. In a k-NN algorithm, each training document
26
j
D (with known category) as well as the document x to be categorized are represented
by features as described above. To conduct categorization, the similarity
) , (
j
D X between each
j
D and X is calculated. In ExpNet [Y94, LH98], the cosine
similarity is employed for the metric :

m
i
ij
m
i
i
m
i
ij i
j
d x
d x
D X
1
2
1
2
1
) , (
where
j
D is an instance in the training collection and is represented as
(
mj j
d d ,....,
1
),
X is the document to be categorized and is represented as (
m
x x ,...,
1
), and m
is the total number of features.

On the other hand, the Dice Coefficient was proposed to calculate the similarity of
two documents [T99]:
j
j
j
D X
D X
D X Dice
+

2
) , (
where
|X| is the number of features in document X,
|D
j
| is the number of features in document D
j
, and
j
D X is the number of features that appear in both documents, X and D
j
.

The training instances are sorted by the similarity metric in descending order.
Subsequently, the k top-ranking instances are selected. The final score of the
document to each category is calculated by considering the similarity metric of these k
27
selected instances and their category association. The basic combination function used
for MBR is to have the k nearest neighbors vote on the answer democracy approach
[BL97]. The classification for the target document is simply the majority vote of the
classifications of the neighbors. To avoid tie-breaking situation, k should be odd when
there are only two categories. In general, a good rule of thumb is to use c+1 neighbors
when there are c categories.

Weighted voting is similar to majority voting except that the neighbors are not all
created equal more like shareholder democracy instead of one-person, one-vote
[BL97]. The size of the vote is inversely proportional to the distance from the new
record, so closer neighbors have stronger votes than neighbors farther away do. To
prevent problems when the distance might be 0, it is common to add 1 to the distance
before taking the inverse. Adding 1 also makes all the votes between 0 and 1.

The k-NN algorithms directly make use of the training examples as instances for
computing the similarity. One of the shortcomings is their sensitivity to noisy
examples. Categorization error occurs if the training instances in the neighborhood
region of the document to be categorized are influenced by noisy examples. In text
categorization problems, it is quite common that instances in the training collection
contain some amount of noise due to such reasons as missing appropriate features,
typographical errors in texts, or wrong category assigned by human. The presence of
such noisy examples will affect categorization performance. Besides, k-NN
algorithms do not cope with irrelevant features effectively [LH98].

28
2.2 Discovery-based Category Management Approach
Agrawal, Bayazdo, and Sirkant [ABS99] proposed a system, called Athena, for
creating, exploiting, and maintaining a hierarchical arrangement of textual documents
through interactive mining-based operations. Main functionality of Athena include:
l Topic Discovery: It is to decompose an unorganized collection of documents into
groups so that each group consists of documents on the same topic.
l Hierarchy Reorganization: It is the same as learning text categorization patterns
based on a pre-defined categories and training documents for each category.
l Hierarchy Population: It is to classify a new document into one pre-defined
category based on the text categorization patterns induced in the hierarchy
reorganization.
l Hierarchy Maintenance: It is to identify manually classified documents that are
misfiled and suggest a category to which each misfiled document should belong.

Among the above-described functions of Athena, only the topic discovery is related to
the maintenance of category hierarchy. It is based on a clustering analysis technique,
called C-Evolve, to discover categories from an unorganized collection of documents,
each of the categories consists of documents on the same topic. That is, Athena is to
discover categories from documents rather than evolve from existing categories into
new categories.

29
CHAPTER 3
Mining-based Category Evolution (MiCE) Technique

In this chapter, the development of mining-based category evolution technique for
adaptively adjust document categories will be depicted. The high level architecture of
the text categorization with category evolution will first be described. Subsequently,
the algorithm of the proposed technique will be detailed and illustrated in this chapter.

3.1 High-level Architecture of Text Categorization with
Categorization Evolution
The relationship between the Mining-based Category Evolution (MiCE) and the text
categorization is shown in Figure 3.1. The lower portion of Figure 3.1 refers to the
traditional text categorization process, which has been described in Chapter 1. An
additional component, called Mining-based Category Evolution (MiCE), is added for
adaptively managing the document categories. MiCE takes as inputs the existing
document categories, existing classification of documents, and documents themselves.
Based on the existing document categories, new categories are emerged. For example,
an existing category may be decomposed into several categories each of which is
smaller in the topic covered, or several similar categories may be merged into a new
category which covers the topics contained in original categories. The newly
emerging categories are generated and serve as the hierarchy for future text
categorization. In addition, re-classification of existing documents may be needed for
those original categories that have been refined. Subsequently, MiCE triggers the Text
Categorization Pattern Learner for re-learning text categorization patterns for the new
categories. That is, the text categorization knowledge is updated to reflect the changes
30
of categories and document classification.

Figure 3.1: High-level Architecture of Text Categorization with Category Evolution

3.2 Algorithm of Mining-based Category Evolution (MiCE)
Technique
The main objective of the Mining-based Category Evolution (MiCE) technique is to
adaptively adjust the document categories. As mentioned, two main category
evolution operations in MiCE are category decomposition and category merging. The
purpose of the category decomposition is to segment a category into several
categories each of which covers fewer topics than its original category does. A
category may be considered to be decomposed if the topics covered by a subset of the
documents that are similar to each other differ from (or disjoint with) those covered
by the remaining documents in the category. On the other hand, the goal of the
category merging is to merge two or more categories into one if they cover similar
Pre-classified
Documents
Document to
be categorized
Text Categorization
Pattern Learner

Text Categorization
Patterns
Reasoner
Categories
Category assignment for
the document
Mining-based
Category Evolution
Triggers
New categories
Existing categories
Existing classification
and documents
Re-classification
31
topics to each other. The major steps in the MiCE technique are shown in Figure 3.2
and depicted in the following subsections.

Figure 3.2: Process of MiCE Technique

Noun and Noun Phrase Extraction
Feature Selection and Representation
Within Category Disjointness Test
Category Decomposition and Document
Re-classification
Feature Selection and Representation
Between Category Overlapping Test
Category Merging
Category
Merging
Category
Decomposition
Existing Classification
and Documents
Existing
Categories
New
Categories
Trigger
Re-classification
Text Categorization
Pattern Learner
32
3.2.1 Category Decomposition
As shown in Figure 3.2, the category decomposition consists of the noun phrase
extraction, the feature selection and representation, the within category disjointness
test, and category decomposition and document re-classification steps.

Noun Phrase Extraction:
As described in Chapter 2, the noun phrase extraction is to extract nouns and noun
phrases from the documents. In this study, we adopted Brill s Parser Tool [B92, B94]
(available at http://www.cs.jhu.edu/~brill/) and Voutilaineu s [V93] noun phrase
detection algorithm for parsing free text documents as well as extracting nouns and
noun phrases from these documents.

Feature Selection and Representation:
Feature selection and representation is to reduce the number of features, and represent
documents with the selected features, as described in Chapter 2. In this study, we
employed correlation coefficient and TF*IDF methods for feature selection, while the
binary representation scheme was adopted as the document representation method.

Within Category Disjointness Test:
In this study, we defined the topics covered by a set of documents that are disjoint
with those covered by another set of documents if the features relevant to these two
sets of documents are different. To test whether a category contains disjoint sets of
documents, we have to cluster all documents in the target category into groups each
member in one group is similar to each other and different from those in other group.
We adopted the correlation coefficient to measure the similarity between any two
documents in the same category. Since the range of a correlation coefficient is [1,1],
33
it is transformed into the typical range [0,1] of the similarity measure. Assume k
features are selected to represent each document in the target category. The similarity
between two documents, d
1
and d
2
, is:
)
2
) ( ) (
) , cov(
1
( ) , (
2 1
2 1
2 1
d V d V
d d
d d Similarity
+

Distance(d
1
,d
2
)= 1 - Similarity(d
1
,d
2
)
where V(d
i
) is the variance of features in d
i
, and
cov(d
i
,d
j
) is the co-variance of features between d
i
and d
j
.

Based on the similarity function defined above, a partition-based clustering approach
is applied (e.g. K-mean [A73, BL97, JD88, S80], PAM (Partitioning Around Medoids)
[KR90, NH94]) to segment the documents in the target category into two groups.
Subsequently, the degree of disjointness between these two groups (i.e., the degree of
within category disjointness) can be derived. The degree of within category
disjointness is defined as:
disjointness
2 1
2 1
2
1 %) , (
c c
c c
F F
F F
c
+

where c is the target category whose documents are clustered into c
1
and c
2

groups,
% is the feature insignificance threshold. It is used to eliminate the
features with low frequency (i.e., less than % of documents) in the
category c,
1
c
F (or
2
c
F ) is the set of features each of which appears in no less than %
of documents in the category c and appears in any document in the c
1
(or c
2
)
group.
34

Example:
Assume that a category currently contain 20 documents (d
1
, d
2
, . , d
20
), and 10
features (f
1
, f
2
, ., f
10
) be selected to represent these documents. Furthermore, assume
the category be clustered into c
1
and c
2
groups where c
1
contains 9 documents (i.e., d
1
,
d
2
, , d
9
), and c
2
contains 11 documents (i.e., d
10
, d
11
, ., d
20
), as shown in Table 3.1.

Features Group Document
f
1
f
2
f
3
f
4
f
5
f
6
f
7
f
8
f
9
f
10

d
1
y y y - y y - - - -
d
2
y - - - y y - - - -
d
3
- - y - y - - - - -
d
4
y - y y - - - - - -
d
5
y - - y - y - - - -
d
6
- - - y y y - - - -
d
7
- - y y y - - - - -
d
8
- - y - y y - - - -

c
1

d
9
y y - - - - - - -
d
10
- - - - y - y - y -
d
11
- - y - y - - y - -
d
12
- - - - y - - y - -
d
13
- - - - - y - - - -
d
14
- - y - y y - y y -
d
15
- - y - - y - - - -
d
16
- - - - - - y - y y
d
17
- - - - - y y - y -
d
18
- - y - y - y y - -
d
19
- - y - y - - - - -

c
2

d
20
- - - - - y y y y -
Table 3.1: Examples of Documents

In this example,
1
c
F ={f
1
, f
3
, f
4
, f
5
, f
6
}. Although the document d
1
in the c
1
group
contains the feature f
2
, f
2
is not included in
1
c
F since it appears in less than 10% of
35
documents in the category.
2
c
F ={f
3
, f
5
, f
6
, f
7
, f
8
, f
9
}. Similarly, the feature f
10
is
eliminated from
2
c
F due to its low frequency. Thus,
disjointness(c, 10%) =
2 1
2 1
2
1
c c
c c
F F
F F
+

= 455 . 0
6 5
3 2
1
+

Subsequently, the decomposition decision can be made for each category. Assume
that the split threshold be
s
. For each category c, if its degree of within category
disjointness is greater than the split threshold, then the category c will be decomposed;
otherwise, no decomposition action will be taken on the category c. That s,
If disjointness(c, % ) >
s
then decompose the category c.

Category Decomposition and Document Re-classification:
For those categories whose degree of within category disjointness is greater than the
split threshold (
s
), the category decomposition and document re-classification will
be applied. In this research, we adopted a partition-based clustering technique (i.e.,
PAM) to decompose each target category into multiple categories. The similarity
between any two documents is measured based on the correlation coefficient, as
described above. The optimal number of decomposed categories is determined by the
silhouette coefficient [KR90]. Assume the cluster to which object i is assigned be A.
Let
a(i) = average dissimilarity of i to all other objects of cluster A
For any cluster C different from A, let
d(i,C) = average dissimilarity of i to all objects of C
36
After computing d(i,C) for all clusters A C , the smallest among them is selected
and denoted as:
b(i) = ) , ( min C i d
A C

The cluster B for which this minimum is attained (i.e., d(i, B) = b(i)) is called the
neighbor of object i. In fact, the cluster B can be viewed as the second-best choice for
object i. Note that the construction of b(i) depends on the availability of clusters
differing from A, which explains why silhouette are not defined for k = 1 where k is
the number of clusters.

The silhouette of object i, s(i), is then obtained by combining a(i) and b(i) as follows:
s(i)
) (
) (
1
i b
i a
if a(i) < b(i)
= 0 if a(i) = b(i)
= 1
) (
) (
i a
i b
if a(i) > b(i)

When cluster A contains only a single object, it is unclear how a(i) should be defined
and then s(i) is simply set to 0. For each object i, it is obvious that the following
condition holds:
1 ) ( 1 i s
The meaning of s(i) can now be illustrated by the following examples. When s(i) is at
its largest (i.e., close to 1), this implies that the within dissimilarity a(i) is much
smaller than the smallest between dissimilarity b(i). Therefore, object i can be
regarded as well classified since the second-best choice (i.e., cluster B) is not nearly
as close as the actual choice (i.e., cluster A). A different situation occurs when s(i) is
about zero. In this case, a(i) and b(i) are approximately equal and hence it is not clear
at all whether object i should have been assigned to cluster A or B. Object i lies
37
equally far away from both cluster, so it can be considered as an intermediate can
easily be made.

As described, s(i) measures how well object i matches the cluster to which the object i
is currently assigned. The average of s(i) for all objects i in the same cluster is called
the average silhouette width of that cluster. On the other hand, the average of all s(i)
for i = 1, 2, ., n (where n is the number of objects in the data set) is called the
average silhouette width for the entire data set and denoted by ) (k s , which can be
used for the selection of the best value of k for which ) (k s is as high as possible.
The silhouette coefficient ) ( max k s SC
k
where the maximum is taken over all k for
which the silhouettes can be constructed (i.e., k = 2, 3, , n-1).

In other words, each target category is partitioned into k clusters, where k = 2, , n/2.
The k resulted in the silhouette coefficient is selected as the optimal number of
clusters for the category. Accordingly, each document in the target category is
assigned into the closest cluster (i.e., new category).

3.2.2 Category Merging
The category merging phase consists of three main steps, as shown in Figure 3.2: the
feature selection and representation, the between category overlapping test, and the
category merging.

Feature Selection and Representation:
In this step, the feature selection was executed again due to the re-classification of
documents in the previous phase. Same as that in the category decomposition phase,
38
we adopted the correlation coefficient and TF*IDF methods for feature selection and
the binary scheme for representation.

Between Category Overlapping:
We calculated the overlapping (or similarity) at the category level. The degree of
overlapping of two categories was defined as:
j i
j i
c c
c c
j i
F F
F F
c c g Overlappin
+

2
) , (
where
i
c
F (or
j
c
F ) is the feature set of the category c
i
(or c
j
).

Example:
Assume that two feature sets of categories c
1
and c
2
be
1
c
F = (f
1
, f
2
, f
3
, f
4
, f
5
, f
6
) and
2
c
F = (f
4
, f
6
, f
7
, f
8
, f
9
, f
10
), respectively. In this example, the features f
4
and f
6
appear in
both categories. Thus, the degree of overlapping between c
1
and c
2
is:
2 1
2 1
2
) , (
2 1
c c
c c
F F
F F
c c g Overlappin
+

=
6 6
2 2
+
=0.33

Subsequently, the merging decision can be made. Assume that the merging threshold
be
m
. For any two categories, if its degree of between category overlapping is
greater than
m
, the two categories can be merged.

Category Merging
Because the overlapping measure is defined for two categories, conflict merging
39
decisions may arise. Assume that category a and category b can be merged (i.e., their
degree of overlapping is greater than
m
), and category b and category c can be
merged too. At the same time, assume overlapping (a, c) <
m
, thus category a and
category c should not be merged. In this situation, conflict merging decisions arrive.
The conflict is resolved as follows. We used the graph representation to represent
these merging decisions. A category is represented as a node, while a merging
decision (c
i
, c
j
) is represented as a labeled link between c
i
and c
j
. The degree of
overlapping between c
i
and c
j
is associated to the link. Accordingly, we represented all
merging decisions derived in the previous step in the graph. For each connected
subgraph of size (i.e., number of nodes) greater than 1, if the subgraph is not a
well-connected graph, then remove the link with the lowest overlapping measure.
Repeat this process until the target subgraph is a well-connected one.

Example:
Assume the merging threshold be 0.4. The graph constructed from all merging
decisions is shown in Figure 3.3.

Figure 3.3: Graph for Merging Decisions

As shown in Figure 3.3, two subgraphs of size 1 exist. The subgraph of {c
1
, c
2
, c
3
} is
not a well- connected graph. Thus, the link with the lowest overlapping value (i.e., the
link between c
1
and c
2
) is removed. As a result, the subgraph becomes a
c
1

c
2

c
3

c
7

c
5

c
4

c
6

0.41
0.66
0.55
0.6
0.45
40
well-connected one, denoting that the merging of c
2
and c
3
should be performed. On
the other hand, the second subgraph of {c
4
, c
5
, c
6
} is a well-connected graph. Thus,
the merging of c
4
, c
5
and c
6
should be proceeded.

3.3 Complete MiCE Algorithm
The MiCE algorithm, described in the previous subsections is shown in Figure 3.4.

INPUT: Categories, pre-classified documents,
s
,
m
, %
/*Category Decomposition*/
Extract nouns and noun phrases from pre-classification
documents;
Select features for all categories and create feature dictionary;
Represent documents with feature dictionary;
For each category c
Calculate disjointness(c, % );
If disjointness(c, % ) >
s

Then
Partition (by applying K-means or PAM) the
documents of the category c into k clusters where k
results in the maximal silhouette coefficient;
Classify each document in the category c into the
closest cluster;
Update the categories by removing c and adding k
new categories.
End-if;
End-for;

/*Category Merging*/
Select features for all categories and create feature dictionary;
Merging =;
For any two categories c
i
and c
j
,
Calculate overlapping(c
i
, c
j
)
If overlapping(c
i
, c
j
) >
m

Then
merging = merging{(c
i
, c
j
)};
End-if;
End-for;

Decision Conflict resolution (merging);
For each d in Decision
Merge all categories contained in d;
Update the categories by removing all categories
contained in d and adding the newly merged category;
End-for;

Figure 3.4: MiCE Algorithm
41
CHAPTER 4
Evaluations of Mining-based Category Evolution Technique

In this chapter, the proposed MiCE technique for category evolution will empirically
be evaluated. The empirical evaluation includes effectiveness of category evolution,
sensitivity to the quality of original categories, and effect of category evolution on the
categorization accuracy.

4.1 Test Data Set
We used the single-category version of Distribution 1.0 of Reuters-21578 collection
1

as the source of the test data set. This collection originally contains 64 categories and
9034 single-category documents. We selected those categories whose number of
documents are equal to or greater than 100. Furthermore, to make the sizes of
categories are more balanced, we selected those documents whose lines are between
10 and 30 in the acq and earn categories (originally, the acq a category contains
2237 documents and the earn category contains 3801 documents). Consequently, 10
categories are resulted and adopted for our evaluation purpose. The summary of the
10 categories selected from the Reuter-21578 collection is provided in Table 4.1.
There are 2697 news documents in total, and the average number of words per
document is 192. A sample news document is shown in Figure 4.1.

1
This collection is publicly available at:
http://www.research.att.com/~lewis/reuters21578.html.
42
News Category Number of Documents Average Number of
Words in Each Document
acq 497 223
coffee 100 195
crude 347 217
earn 381 215
interest 165 172
money-fx 250 179
money-supply 155 158
ship 148 145
sugar 125 169
trade 323 250
Total Category 2697 192
Table 4.1: Summary of Test Data Set

Category: Interest
SUBJECT: BANK OF JAPAN TO SELL 1,200 BILLION YEN IN BILL
DATE: June 18, 1987

The Bank of Japan will tomorrow sell 1,200 billion yen in bills from its
holdings to help absorb a projected money market surplus of 2,100 billion,
money market traders said.
Of the total, 800 billion yen will yield 3.6004 pct on sales from money
houses to banks and securities houses in 34-day repurchase agreements
maturing on August 3.
The other 200 billion yen will yield 3.6003 pct in 43-day repurchase
accords maturing on August 12.
The remaining 200 billion yen will yield 3.6503 pct in 50-day repurchase
agreements maturing on August 19.
The repurchase agreement yields compare with the 3.5625 pct one-month
commercial bill discount rate today and 3.6250 pct on two-month bills.
They attributed the projected surplus mainly to 1,900 billion yen of
government tax allocations to local governments and public bodies.

Figure 4.1: Sample News Document

4.2 Effectiveness of Category Evolution
In this experiment, we examined the effectiveness of category evolution based on the
MiCE technique. The discovery-based for category discovery, as suggested in
[ABS99], will be used as the benchmark. Specifically, we implemented the PAM
clustering algorithm for category discovery.
43

4.2.1 Evaluation Procedure
We assume that the categories provided by the Reuter-21578 collection be the true
categories for the data set. We randomly selected 6 categories from the 10 categories
in the test data set. To evaluate the effectiveness of the category evolution, we
randomly selected 3 categories, each of which was split into 2 new categories. The
news documents in each selected categories were evenly and randomly assigned into
its new categories. As a result, 9 categories were resulted. Subsequently, 2 categories
were randomly selected and merged into a new category and 3 merges were
performed. Finally, the 6 original categories were corrupted into 6 new categories.

Using the 6 corrupted categories, the MiCE technique was applied for category
evolution. The evolved categories were then compared with the true categories based
on some evaluation criteria (we will discuss these evaluation criteria in the next
subsection). For each of the selection-corruption-evolution-evaluation process, five
trials were performed to avoid possible bias in the random selection and corruption
process.

4.2.2 Evaluation Criteria
We defined three evaluation criteria to evaluate the effectiveness of category
evolution using the MiCE technique: purity, diversity and specificity. We prefer each
evolved category to containing only documents from a single true category. Thus, the
purity of an evolved category is defined as the maximum number of documents within
the evolved category that belong to the same true category, divided by the total
number of documents within the evolved category [ABS99].
44
c
c
N
n
c Purity ) (
where
n
c
is the maximal number of documents in the evolved category c that belong to
the same true category, and
N
c
is the total number of documents in the evolved category c.

The overall purity of the evolved categories is the weighted average of the purities of
all evolved categories.

c
c
N
N
c Purity Purity ) (
where
c
c
N N is the total number of documents in all of the categories.

On the other hand, the ideal category evolution is that all of the true categories are
covered by the evolved categories. We define that the dominant class of an evolved
category covers the true category the evolved category covers. Thus, the diversity is
defined as the percentage of true categories that are covered by the evolved
categories.
c
c
T
t
Diversity
where t
c
is the number of true categories covered by the evolved categories, and
T
c
is the number of true categories.

The last evaluation criterion is specificity which measures the efficiency of the
evolved categories in representing the true categories. The specificity is defined as the
number of the true categories that are covered by the evolved categories, divided by
the total number of the evolved categories.
45
e
c
T
t
y Specificit
where
t
c
is the number of true categories covered by the evolved categories; and
T
e
is the total number of evolved categories.

Example:
Assume that the true categories be A, B, C and D. Each of the true categories contains
250 documents. Let the evolved categories be E
1
, E
2
, E
3
, E
4
and E
5
. The documents in
each evolved category and the true category covered by each evolved category are
shown in Table 4.2. For example, the evolved category E
3
contains 250 documents
belonging to the true category B and 150 documents belonging to the true category C.
Since the documents of the category B are dominant in E
3
, the true category covered
by E
3
is B.

True Category Number of
Documents
A 250
B 250
C 250
D 250

Evolved
Category
Documents True Category
Covered
E
1
100 documents from A A
E
2
150 documents from A A
E
3
250 documents from B and 150 documents
from C
B
E
4
150 documents from D and 100 documents
from C
D
E
5
100 documents from D D
Table 4.2: Example of True Categories and Evolved Categories

46
The purity of the evolved categories in this example is:
Purity = +
1000
100
) (
1
E Purity
+
1000
150
) (
2
E Purity
+
1000
400
) (
3
E Purity
+
1000
250
) (
4
E Purity
+
1000
100
) (
5
E Purity
= + + +
1000
400
400
250
1000
150
150
150
1000
100
100
100

1000
100
100
100
1000
250
250
150
+
=
1000
750
=75%

The evolved categories cover the true categories A, B and D, as shown in Table 4.2.
Thus, the diversity of the evolved categories is:
Diversity =
{ }
{ } 4
3
, , ,
, ,
D C B A
D B A
=75%

On the other hand, the specificity of the evolved categories is:
Specificity =
{ }
{ }
5 4 3 2 1
, , , ,
, ,
E E E E E
D B A

=
5
3
= 60%.

47

4.2.3 Result
As mentioned, we employed the correlation coefficient and TF*IDF as the feature
selection methods in the MiCE technique. We will report its effect on the
effectiveness of category evolution. Several parameters are specified in the MiCE
technique, including the number of features for representing documents (k), the
feature insignificance threshold ( % ), the split threshold (
s
), and the merging
threshold (
m
). In this experiment, we tested different k ranging from 25 to 100 with
an increment of 25, 4 different split thresholds (0.7, 0.6, 0.5, and 0.4), and 3 different
merging thresholds (0.4, 0.3, and 0.2). We set the feature insignificance threshold be
5%; that is, any feature that appears in less than 5% of documents in a category will
be ignored from the calculation of the within category disjointness measure.

Setting the split threshold as 0.6 and the merge threshold as 0.3, the performance of
the MiCE technique for category evolution is shown in Table 4.3.

Correlation Coefficient Feature Selection Method
Number of Feature (k) Purity Diversity Specificity
25 85.7% 96.67% 63.44%
50 86.6% 100% 65.89%
75 91.8% 96.67% 47.28%
100 92.4% 100% 52.28%

TF*IDF Feature Selection Method
Number of Feature (k) Purity Diversity Specificity
25 70% 80% 100%
50 70% 83.3% 100%
75 70% 76.67% 89.81%
100 72.57% 83.3% 84.57%
Table 4.3: Experimental Result (
s
=0.6,
m
=0.4)

48
As shown in Table 4.3, in both feature selection methods, the number of features
increased from 25 to 100, the purity of the evolved categories improved at the cost of
specificity. In this case, the specificity decreased for 63.44% to 52.28% for the
correlation coefficient feature selection method, and from 100% to 84.57% for the
TF*IDF method. The increase of the number of features had no effect on the diversity
measure for both feature selection methods.

The correlation coefficient feature selection method achieved better purity and
diversity measures than its counterpart at any number of features investigated.
However, the specificity resulted from the use of TF*IDF feature selection method
was much higher than the correlation coefficient feature selection method did at any
number of features, examined. When the number of features was 100, 52.28% of the
evolved categories corresponded to the true categories for the correlation coefficient
method and 84.50% for the TF*IDF method. Judging from the balance of the three
criteria, the TF*IDF method arrived a more balanced performance than the correlation
coefficient. Thus, in the subsequent evaluations, the TF*IDF method will be adopted
as the feature selection method. The number of features will be set to 100 in the
following evaluations since it achieved best purity and diversity measures, although
the specificity was degraded from 100% to 84.57% (for the TF*IDF method).

Next, the effect of different split and merging thresholds on the category evolution
will be investigated. The TF*IDF was adopted as the feature selection method, and
the number of features was 100, as suggested in the previous experiment. As shown in
Table 4.4, when decreasing the split threshold, the purity of the evolved categories
improved at any level of the merging threshold. On the other hand, the decrease of
merging threshold resulted in the decrease of the purity of the evolved categories.
49

Merging Threshold ( m) Split Threshold
( s ) 0.4 0.3 0.2
0.7 70.43% 65.43% 60.43%
0.6 72.57% 69% 67.93%
0.5 75.87% 72.53% 63.37%
0.4 79.92% 79.92% 75.75%
Table 4.4: Purity of Evolved Categories
(Feature Selection Method =TF*IDF, k=100)

The effect of the split and merging thresholds on the diversity of evolved categories
(as shown in Table 4.5) is similar to that on the purity. On average, the diversity
reached a satisfactory performance (80%).

( s ) 0.4 0.3 0.2
0.7 76.67% 70% 63.33%
0.6 83.3% 76.67% 73.33%
0.5 83.33% 80% 70%
0.4 87.5% 87.5% 83.33%
Table 4.5: Diversity of Evolved Categories

As shown in Table 4.6, the proposed MiCE technique achieved satisfactory
performance measured by specificity, ranging from 84.57% (when the split threshold
was 0.6 and the merging threshold was 0.4) to 100% (when the split threshold was 0.4
and the merging threshold was 0.2).

( s ) 0.4 0.3 0.2
0.7 91.67% 93.33% 96%
0.6 84.57% 85.48% 91.67%
0.5 87.43% 87.43% 90.29%
0.4 92.26% 95.83% 100%
Table 4.6: Specificity of Evolved Categories
50

The proposed MiCE technique for category evolution was benchmarked with the
discovery-based category management approach as suggested in [ABS99]. We
implemented the partition-based clustering technique (i.e., PAM) for category
discovery. The randomly selected six categories were merged into a single category
from which the clustering technique was used to find the optimal clustering. Similarly,
the silhouette measure was employed to determine the optimal number of clusters.
The selection-and-discovery process was repeated five times and the overall
performance was estimated by averaging the performance across all iterations. As
shown in Table 4.7, the purity was 62.6% which was lower than that achieved by any
split-merging threshold combination of the MiCE technique. On average, the
discovery-based category management approach performed better slightly in the
diversity measure, as compared to that of the MiCE technique. On the other hand, the
MiCE technique outperformed the discovery-based category management approach in
specificity. Jointly, the empirical results suggested that our proposed technique could
result in a better category hierarchy than the discovery-based category management
approach.

Purity Diversity Specificity
62.6% 86.67% 69.67%
Table 4.7: Performance of the Discovery-based Category Management Approach

51
4.3 Sensitivity to the Quality of Categories
In this experiment, we examined the performance of the MiCE technique for category
evolution when varying the quality of categories. As mentioned in the previous
section, we assume that the categories defined in the Reuter-21578 collections are the
true categories. We randomly decomposed and merged the true categories to alter the
quality of the categories. By changing the number of decompositions or merges, data
sets with different quality were produced. The more number of decompositions or
merges made on the true categories, the worse quality of the resulted test data set.

The number of decompositions (splits) ranged from 2 to 5, and the number of merges
ranged from 2 to 5. For each specific number of splits and number of merges, we
randomly selected 6 categories from the 10 categories in the Reuter-21578. 100 news
documents were randomly selected from each selected category. The 6 categories
were then altered by the specified number of decomposition and merging operations.
We repeated the process five times for each specific number of splits and merges and
the overall performance was estimated by the average performance across all
iterations. The parameters used in this experiment are as follows: TF*IDF as the
feature selection method, binary scheme as the document representation, split
threshold as 0.4 and merging threshold as 0.3. The empirical results are shown in
Table 4.8 to Table 4.9.

52
Number of Splits
2 3 4 5 Average
2 93.79% 91.30% 92.04% 93.27% 92.6%
3 84.53% 83.96% 88.90% 87.37% 86.19%
4 85.13% 86.20% 79% 86.03% 84.09%

Number
of
Mergings
5 64.63% 70.42% 80.40% 80.97% 74.10%
Average 82.02% 82.97% 85.09% 86.91% 84.25%
Table 4.8: Sensitivity Experiment: Purity of Evolved Categories

Number of Splits
2 3 4 5 Average
2 100% 96.67% 95.83% 100% 98.13%
3 90% 95.83% 96.67% 100% 95.62%
4 96.67% 100% 90% 100% 96.67%

Number
of
Mergings
5 83.33% 83.33% 100% 100% 91.67%
Average 92.5% 93.96% 95.63% 100% 95.52%
Table 4.9: Sensitivity Experiment: Diversity of Evolved Categories

Number of Splits
2 3 4 5 Average
2 57.68% 57.90% 50.59% 57.68% 55.96%
3 72.49% 89.29% 71.39% 72.49% 76.42%
4 86.43% 72.56% 71.74% 86.43% 79.29%

Number
of
Mergings
5 93.33% 100% 78.81% 93.33% 91.37%
Average 77.48% 79.94% 68.13% 77.48% 75.76%
Table 4.10: Sensitivity Experiment: Specificity of Evolved Categories

As shown in Table 4.8, 4.9 and 4.10, when the number of splits increased, the purity
of the evolved categories was slightly affected at any number of mergings. On the
other hand, the deterioration of category quality of the test data set by increasing the
number of merges resulted in the decrease of the purity measure (from 92.6% to
74.10%). However, the effect of different numbers of splits or mergings on the
53
diversity was not significant. The increase of number of mergings improved the
specificity of the evolved categories. In the worse category quality (i.e., the number of
splits being 5 and the number of mergings being 5 as well), the proposed MiCE
technique can still achieve satisfactory performance, manifested by a 80.97% purity,
100% diversity and 84.14% specificity. Throughout all number of splits and mergings
examined, the MiCE technique demonstrated its satisfactory effectiveness of category
evolution (the average purity = 84.25%, the average diversity = 95.52%, and the
average specificity = 75.76%).

4.4 Effect of Category Evolution on Categorization Accuracy
Theoretically, a good category evolution should lead to a better category hierarchy
and thus a higher categorization accuracy. Therefore, the effect of the category
evolution using the MiCE technique on the categorization accuracy need to be
empirically evaluated.

We selected the categories whose number of documents is more than 150 from our
test data set. As a result, 6 categories were selected: acq, crude, earn, interest,
money-fx, and trade. The total number of documents was 1767.

4.4.1 Evaluation Procedure
The experiment was proceeded as follows. The selected categories and their training
documents were randomly split and/or merged into new categories. The number of
splits or mergings was randomly determined, ranged from 1 to 3. Since the categories
were altered, the testing documents needs to be assigned into new categories. The
category assignment for the test documents is illustrated in Figure 4.2. Assume that
54
the true categories are A, B and C. After the alteration, the altered categories are A
1

(which is a part of A), A
2
B (which is the merging of B and part of A), and C (which
corresponds to the original C). A testing document d
i
of the category A can be
assigned to either A
1
category or A
2
B category. That is, if the text categorization
suggests that d
i
belongs to the category A
1
or A
2
B, it will be counted as a correct
categorization; otherwise, it is an incorrect one.

True Categories A B C
Altered Categories A
1
A
2
B C
compose of
True Category of a
Testing Document
Correct Category
Assignment Based on
Altered Categories
A A
1
or A
2
B
B A
2
B
C C
Figure 4.2: Examples of Category Assignment for Testing Documents

The altered categories and their associated training documents were input into a text
categorization system for learning the text categorization patterns. In this study, we
adopted CN2 and C4.5 as the learning algorithms. Subsequently, the testing
documents were classified using the induced text categorization patterns. The
accuracy of the classification was then measured.

On the other hand, the altered categories were evolved using the MiCE technique and
the training documents were re-classified into newly evolved categories. How the
testing documents are associated to the evolved categories need to be determined. For
each evolved category, its corresponding true category (or categories) was determined
55
first. Based on the percentage of documents with the same true category in each
evolved category, we went through each evolved category and determined its
corresponding true category, (or categories). For example, as shown in Figure 4.3,
assume that the true categories are A, B and C. The newly evolved categories are E
1
,
E
2
, E
3
and E
4
. The corresponding true category for E1 is A, while the corresponding
true categories for E
2
are A and B. A testing document d
i
of the category A can be
classified into E
1
or E
2
. That is, if the text categorization suggests that d
i
belongs to
the evolved categories E
1
or E
2
, it will be regarded as a correct categorization;
otherwise, it is an incorrect one.

True Categories = {A, B, C}

Evolved Category Corresponding True Category
E
1
A
E
2
A and B
E
3
C
E
4
C

True Category of a
testing Document
Correct Category
Assignment Based on
Evolved Categories
A E
1
or E
2

B E
2

C E
3
or E
4

Figure 4.3: Examples of Category Assignment (After Category Evolution) for Testing
Documents

Subsequently, the evolved categories and their associated training documents were
input into a learning algorithm (i.e., CN2 and C4.5) for inducing the text
categorization pattern for each evolved category. Afterwards, the testing documents
were classified using the induced text categorization patterns. The accuracy of the
56
classification was also measured.

The 10- fold cross validation technique [WK91] was adopted. That is, the documents
in each category were randomly divided into ten mutually exclusive data sets of equal
size. The learning-and-testing or evolution-learning-and-testing process proceeded in
an iteration manner. In each iteration, one data set was chosen as the testing data and
the others were used for learning or evolution-and- learning purpose. Thus, the
learning-and-testing or evolution- learning-and-testing process was performed 10
times and the overall learning performance was estimated by averaging the
performance across all iterations.

4.4.2 Results
As shown in Table 4.11, before the category evolved, the classification accuracy
achieved by CN2 and C4.5, were 62.62% and 69.28%, respectively. On the other hand,
after the category evolution, the classification accuracy improved from 62.62% to
67.79% (using CN2) and from 69.28% to 73.74% (using C4.5). Thus, the category
evolution using our MiCE technique has shown its capability in improving the
classification accuracy.
Before Category
Evolution
After Category
Evolution
Accuracy Rate of
CN2
62.62% 67.79%
Accuracy Rate of
C4.5
69.28% 73.74%
Table 4.11: Comparison of Classification Accuracy
57
CHAPTER 5
Conclusions and Future Research Directions

Assigning categories to documents is essential to the efficient management and
retrieval of documents. Most past works related to this area focused on improving
classification accuracy or comparing the performance of different classification
methods. The development of techniques for category evolution has been ignored in
the literature. In this study, we proposed a mining-based category evolution (MiCE)
technique to adjust the categories based on the existing categories and their associated
documents. According to the empirical evaluation results, the proposed technique,
MiCE, was more effective than the discovery-based category management approach,
insensitive to the quality of original categories, and capable of improving
classification accuracy.

Some ongoing and future directions along this line of research include:
1. The text documents used in this study were Reuter news. It would be desired and
interesting to evaluate the MiCE technique using documents of other areas (e.g.,
emails) or using larger data sets.

2. As mentioned, we adopted the correlation coefficient to measure the similarity
between documents in the category decomposition phase. The feature selection
methods evaluated included correlation coefficient and TF*IDF, while the
representation method was binary. In the future, different document similarity
measures (e.g., cosine function), feature selection methods (e.g., mutual information,
DICE), and representation methods (e.g., term frequency, TF*IDF) can be
58
incorporated into the MiCE technique and empirically evaluated.

3. This study assumed that at most one category is assigned for each document.
However, in real-world applications, it is often observed a document may be assigned
to more than one categories. Thus, the extension of the proposed MiCE technique to
deal with the requirement of the multi-category assignment would be necessary.

4. The evolution of categories and the re-classification of documents require the
repeated execution of feature selection and may trigger the re- learning of text
categorization patterns. In this study, the non-incremental feature selection and the
non- incremental inductive learning techniques. An incremental feature selection and
incremental inductive learning are essential to improving the efficiency of the
proposed category evolution.

5. Many knowledge are embedded in textual documents. This study applied some data
mining techniques for category evolution. Other data mining applications for
extracting useful knowledge would be essential to exploit the values of documents
maintained by organizations.
59
REFERENCES
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc.,
1973.
[ABN92] Anwar, T. M., Beck, H. W., and Navathe, S. B., Knowledge Mining by
Imprecise Querying: A Classification-Based Approach, Proceeding of the
Eighth International Conference Data Engineering, Feb. 1992, pp. 622-630.
[ABS99] Agrawal, R., Bayardo, R., and Srikant, R., Athena: Mining-based
Interactive Management of Text Databases, Proceedings of the Seventh
Conference on Extending Database Technology, July, 1999.
[ADW94] Apte , C., Damerau, F., Weiss, S. M., Automated Learning of Decision
Rules for Text Categorization, ACM Transactions on Information Systems.
Vol. 12, No. 3, July 1994, pp. 233-251.
[AGIIS92]Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., and Swami, A., An
Interval Classifier for Database Mining Applications, Proceeding of the
18th International Conference on Very Large Data Bases, Aug. 1992, pp.
560-573.
[B92] E. Brill, "A Simple Rule-Based Part of Speech Tagger," In Proceedings of
the Third Conference on Applied Natural Language Processing, ACL,
Trento, Italy, 1992.
[B94] E. Brill, "Some Advances in Rule-Based Part of Speech Tagging,"
Proceedings of the Twelfth National Conference on Artificial Intelligence
(AAAI-94), Seattle, Wa., 1994.
[B95] Boll, E. M., Analysis of Rule Sets Generated by the CN2, ID3, and
Multiple Convergence Symbolic Learning Methods, Proceedings of the
1995 ACM 23rd Annual Conference on Computer Science Conference, 1995,
60
pp. 48-55.
[BFOS84] Breiman, L., Friedman, J., Olshen, R., and Stone, C., Classification of
Regression Trees. Wadsworth, 1984.
[BL97] Berry, M. J. A., and Linoff, G., Data Mining Techniques: For Marketing,
Sales and Customer Support, Wiley, 1997.
[BP91] Brunk, C., and Pazzani, M., Noise-tolerant Relational Concept Learning
Algorithms, Proceedings of the 8th International Workshop on Machine
Learning, Ithaca, NY, 1991.
[C93] Cohen, W. W., Efficient Pruning Methods for Separate-and-Conquer Rule
Learning Systems, Proceedings of the 13th International Joint Conference
on Artificial Intelligence, Chambery, France, 1993.
[CFS94] Chidanand Apte , Fred Damerau, Sholom M. Weiss, Automated Learning
of Decision Rules for Text Categorization, ACM Transactions on
Information Systems, July 1994, Vol.12, No.3, pp.233-251.
[CH89] Church, K. W. and Hanks, P., Word Association Norms, Mutual
Information, and Lexicography, Proceedings of the 27th Annual Meeting of
the Association for Computational Linguistics, 1989, pp.76-83.
[CHY96]Chen, M. S., Han, J., and Yu, P. S., Data Mining: An Overview from A
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 6, December 1996, pp.866-883.
[CN89] Clark, P., and Niblett, T., The CN2 Induction Algorithm, Machine
Learning, Vol. 3, 1989, pp.261-283.
[CS96a] Cheeseman, P., and Stutz, J., Bayesian Classification (AutoClass): Theory
and Results, Advances in Knowledge Discovery and Data Mining, U. M.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.),
AAAI/MIT Press, 1996, pp. 153-180.
[CS96b] Cohen, W. W., and Singer, Y., Learning to Query the Web, Proceedings of
61
the 13th National Conference on Artificial Intelligence, Portland, OR, 1996.
[CS99] Cohen, W. W., and Singer, Y., Context-sensitive Learning Methods For Test
Categorization, ACM Transactions on Information Systems, Vol. 17, No. 2,
April 1999, pp.141-173.
[D76] Dudani, S. A., The Distance-Weighted k-Nearest-Neighbor Rule, IEEE
Transactions on Systems, Man and Cybernetics, Vol. SMC-6, No. 4, April
1976, pp. 325-327.
[DH73] Duda, R. O. and Hart, P. E., Pattern Classification and Scene Analysis,
Wiley, New York, 1973.
[DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., Inductive Learning
Algorithms and Representations for Text Categorization, Proceedings of the
1998 ACM 7th International Conference on Information and Knowledge
Management (CIKM '98), 1998, pp.148-155.
[EP96] Elder IV, J., and Pregibon, D., A Statistical Perspective on Knowledge
Discovery in Databases, Advances in Knowledge Discovery and Data
Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(Eds.), AAAI/MIT Press, 1996, pp. 83-115.
[FW94] Furnkranz, J., and Widmer, G., Incremental Reduced Error Pruning,
Proceedings of the 11th Annual Conference on Machine Learning, New
Brunswick, NJ, 1994.
[G96] Gains, B. R., Transforming Rules and Trees into Comprehensive
Knowledge Structures, Advances in Knowledge Discovery and Data
Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(Eds.), AAAI/MIT Press, 1996, pp. 205-228.
[K96] Klosgen, W., Explora: A Multipattern and Multistrategy Discovery
Assistant, Advances in Knowledge Discovery and Data Mining, U. M.
62
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.),
AAAI/MIT Press, 1996, pp. 249-271.
[KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction
to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
[L92a] Lewis, D. D., An Evaluation of Phrasal and Clustered Representations on A
Text Categorization Task, Proceedings of the Fifteenth Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval, 1992, pp. 37-50.
[L92b] Lewis, D. D., Feature Selection and Feature Extraction for Text
Categorization, Proceedings of the Speech and National Language
Workshop, 1992, pp. 212-217.
[LH98] Lam W., and Ho, C. Y., Using A Generalized Instance set for Automatic
Text categorization; Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
1998, pp. 81-89.
[LSL95] Lu, H., Setiono, R., and Liu, H., NeuroRule: A Connectionist Approach to
Data Mining, Proceedings of 21st International Conference on Very Large
Data Bases, Sept. 1995, pp. 478-489.
[MAR96] Mehta, M., Agrawal, R., and Rissanen, J., SLIQ: A Fast Scalable Classifier
for Data Mining, Proceedings of International Conference on Extending
Database Technology (EDBT 96), Avignon, France, Mar. 1996.
[MMHL86] Micilalski, R., Mozetic, L., Hong, J., and Lavrac, N., The Multi-purpose
Incremental Learning System AQ15 and Its Testing Application to Three
Medical Domains, Proceedings of the AAAI-86, Menlo Park, CA, 1986,
pp.1041-1045.
[NGL97] Ng, H. T., Goh, W. B., and Low, K. L., Feature Selection, Perception
63
Learning, and A Usability Case Study for Text Categorization, Proceedings
of the 20th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 1997, pp. 67-73.
[NH94] Ng, R. and Han, J., Efficient and Effective Clustering Methods for Spatial
Data Mining, Proceedings of International Conference on Very Large Data
Bases, Santiage, Chile, Sept. 1994, pp. 144-155.
[P91] Piatetsky-Shapiro, G., Discovery, Analysis, and Presentation of Strong
Rules, Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. J.
Frawley (Eds.), AAAI/MIT Press, 1991, pp. 229-238.
[PH90] Pagallo, M., and Haussler, D., 1990, Boolean Feature Discovery in
Empirical Learning, Machine Learning, Vol. 5, No. 1, March 1990, pp.
71-99.
[Q86] Quinlan, J. R., Induction of Decision Trees, Machine Learning, Vol. 1,
1986, pp. 81-106.
[Q90] Quinlan, J. R., MDL and Categorical Theories (continued), Proceedings of
the 12th International Conference on Machine Learning, Lake Tahoe, CA,
1995.
[Q93] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann,
1993.
[RH96] Riloff, E. and Hollaar, L., Text Databases and Information Retrieval, ACM
Computing Surveys, Vol. 28, No. 1, March 1996.
[RK98] Ragas, H. and Koster, C., Four Text Classification Algorithms Compared in
a Dutch Corpus, Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval , 1998,
pp.369-370.
[S80] Spath, H., Cluster Analysis Algorithms: For Data Reduction and
64
Classification of Objects, John Wiley & Sons, Inc., New York, 1980.
[SHP95] Schutze, H., Hull, D. A., and J. O., Pedersen, A Comparison of Classifiers
and Document Representations for the Routing Problem, Proceedings of
the 18th International ACM SIGIR Conference on Research and
Development in Information Retrieval, 1995, pp. 229-237.
[T99] Tu, H. L., Automatic Categorization of News Using Title Analysis,
Unpublished Master Thesis, National Tsing-Hua University, Taiwan, R.O.C.,
July 1999 (in Chinese).
[V93] Voutilainen, A., "NPtool, a detector of English noun phrases," In
Proceedings of Workshop on Very Large Corpora, Ohio, Jun., 1993.
[WGT90] Weiss, S., Galen, R., and Tadepalli, P., Maximizing the Predictive Value of
Production Rules, Artificial Intelligence, Vol. 45, 1990, pp. 47-71.
[WI93] Weiss, S., Indurkhya, N. Optimized Rule Induction, IEEE Expert, Vol .8,
No. 6, 1993, pp.61-69.
[WK91] Weiss, S. M., and Kulikowski, C. A., Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets, Machine
Learning, and Expert Systems, Morgan Kaufman, 1991.
[Y94] Y., Yang, Expert Network: Effective and Efficient Learning from Human
Decisions in Text Categorization and Retrieval, Proceedings of the 17th
International ACM SIGIR Conference on Research and Development in
Information Retrieval, 1994, pp. 13-22.
[YC94] Yang, Y. and Chute, C. G., An Example-Based Mapping Method for Text
Categorization and Retrieval, ACM Transactions on Information Systems,
Vol. 12, No. 3, July 1994, pp. 252-277.
[YPC98] Yang, Y., Pierce, T. and Carbonell, J., A Study on Retrospective and Online
Event Detection, Proceedings of SIGIR 98: 21st Annual International
65
ACM SIGIR Conference on Research and Development in Information
Retrieval, ACM press, New York, 1998, pp.28-36.
[YCB99]Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T. and Liu,
X., Learning Approaches for Detecting and Tracking News Events, IEEE
Intelligent Systems and Their Applications, 1999, pp.32-43.
[Z94] Ziarko, W., Rough Sets, Fuzzy Sets and Knwoledge Discovery,
Springer-Verlag, 1994.

Master Thesis v11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Master Thesis v11

Uploaded by

Copyright:

Available Formats

I

Mining-based Category Evolution for Text Databases

You might also like