A Modified Ant-Based Text Clustering Algorithm With Semantic Similarity Measure

J Syst Sci Syst Eng(Dec 2006) 15(4): 474-492 ISSN: 1004-3756 (Paper) 1861-9576 (Online)
DOI: 10.1007/s11518-006-5029-z CN11-2983/N
A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH

SEMANTIC SIMILARITY MEASURE∗
Haoxiang XIA1 Shuguang WANG2 Taketoshi YOSHIDA3

1
Institute of Systems Engineering, Dalian University of Technology, Dalian, 116024, China
hxxia@dlut.edu.cn ( )
2
BHR-Frontline Technologies (Dalian) Corporation Ltd, Dalian, 116023, China
3
School of Knowledge Science, Japan Advanced Institute of Science and Technology, Ishikawa 923-1292 Japan
yoshida@jaist.ac.jp
Abstract
Ant-based text clustering is a promising technique that has attracted great research attention. This
paper attempts to improve the standard ant-based text-clustering algorithm in two dimensions. On one
hand, the ontology-based semantic similarity measure is used in conjunction with the traditional
vector-space-model-based measure to provide more accurate assessment of the similarity between
documents. On the other, the ant behavior model is modified to pursue better algorithmic performance.
Especially, the ant movement rule is adjusted so as to direct a laden ant toward a dense area of the
same type of items as the ant’s carrying item, and to direct an unladen ant toward an area that contains
an item dissimilar with the surrounding items within its Moore neighborhood. Using WordNet as the
base ontology for assessing the semantic similarity between documents, the proposed algorithm is
tested with a sample set of documents excerpted from the Reuters-21578 corpus and the experiment
results partly indicate that the proposed algorithm perform better than the standard ant-based
text-clustering algorithm and the k-means algorithm.
Keywords: Ant-based clustering, text clustering, ant movement rule, semantic similarity measure
1. Introduction with an item outside the group. Such groups are

As an important application field of the data called clusters, which are run-timely formed
clustering technologies (Jain and Murty et al. during the clustering process, instead of being
1999), text clustering is unsupervised pre-defined as in the case of text categorization,
partitioning of a collection of textual documents which commonly refers to the supervised
into self-similar groups so that any item is more partitioning of documents to “labeled” (i.e.,
similar with another item in the same group than pre-classified) sets (Sebastiani 2002). With the
∗
This work was supported in part by National Natural Science Foundation of China under Grants No.70301009
and No. 70431001, and by Ministry of Education, Culture, Sports, Science and Technology of Japan under the
“Kanazawa Region, Ishikawa High-Tech Sensing Cluster of Knowledge-Based Cluster Creation Project”.
© Systems Engineering Society of China & Springer-Verlag 2006

XIA, WANG and YOSHIDA
abundance of textual documents, e.g. over the unifasciatus. In their model, the ants tend to pick
World Wide Web and in corporate document up the isolated items (corpses or larvae) and
management systems, there is an increasing bring them to the positions that already contain
demand for text mining techniques (Berry 2003). items of the same type; thus the existing small
Consequently, as a noteworthy branch of text clusters may attract the ants to deposit more
mining, text clustering has caught great attention items and finally this process would give rise to
in the last decade; and various data clustering the emergence of large clusters. This work
methods have been applied, e.g. agglomerative stimulated further researches on developing
hierarchical clustering (Ward 1963), k-means ant-based clustering methods in the succeeding
(Hartigan and Wong 1979), OPTICS (Ankerst years. A notable effort was made by Lumer and
and Breunig et al. 1999), and genetic- Faieta (1994), which generalized Deneubourg
algorithm-based clustering (Chiou and Lan and Goss et al.’s model and applied it to
2000). However, despite tremendous endeavors, numerical data analysis. Based on these
the performance of the existing methods is not pioneering contributions, more recent endeavors
often satisfactory in actual applications; more on the ant-based clustering models have also
work needs to be done to exploit better text been reported (e.g., Monmarché 1999, Handl
clustering methods. and Meyer 2002, Kanade and Hall 2003, Vizine
Essentially, a text clustering method can be and de Castro et al. 2005); and the proposed
regarded as a data clustering method combined ant-based clustering methods have been applied
with a particular document similarity measure. to various areas, such as graph partitioning
Therefore, the general data-clustering algorithm (Kuntz and Snyers et al. 1998), intrusion
and the document similarity measure are the detection (Ramos and Abraham 2005), and text
major research subjects on text clustering clustering (Handl and Meyer 2002, Hoe and Lai
methods. These two aspects are also the major et al. 2002, Ramos and Merelo 2002). These
concerns of the present paper. efforts reveal that ant-based clustering has
From the aspect of the data-clustering become an active research field. Nevertheless,
algorithm, we notice the recent efforts on the despite all the notable work, this field is still
ant-based clustering approach inspired by the immature; and it is an open problem to develop
ideas of “swarm intelligence” (Beni and Wang more elaborate ant-clustering models for
1989, Bonabeau and Dorigo et al. 1999), which improving the algorithmic performance.
refers to the collective intelligent behaviors One focal point of this paper is to improve
emerging from the local interactions of groups the algorithmic performance of ant-based
of less-intelligent agents. An early attempt of the clustering by modifying the ant movement rule.
ant-based clustering model was given by Our observation is that in the majority of the
Deneubourg and Goss et al. (1991). They existing ant-based clustering methods, the ant
modeled the corpses-clustering and larvae- movement is supposed to be completely blind;
sorting activities respectively observed in the and this blind walk model could possibly
ants Pheidole pallidula and Leptothorax hamper the convergence or at least decrease the
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 475

A Modified Ant-Based Text Clustering Algorithm with Semantic Similarity Measure
efficiency of the algorithm. To overcome this representing word is largely neglected in VSM.
limitation, we try to establish some mechanism Two semantically relevant words would always
to direct a laden ant toward a dense area of items be treated as orthogonal axes in the vector space;
that are in the same type with the item being in result, semantically relevant documents that
carried by the current ant, and to direct an are respectively represented by these two words
unladen ant to a position which contains an item may be treated as entirely irrelevant. Facing
that is dissimilar with the surrounding items, in these limitations, a recent trend is to adopt the
hope that such modified ant movement rule ontology-based semantic similarity measure in
would boost the algorithm to converge to the text clustering (e.g. Hotho and Staab et al. 2003,
appropriate clusters more rapidly. Besides, other Jing and Zhou et al. 2006). However, so far as
modifications of the “standard” ant-based we know, the semantic similarity measure is
clustering algorithm are also discussed by seldom adopted in the existing ant-based text
making the “picking” and “dropping” thresholds, clustering methods. It would be worthwhile to
as well as each ant’s “vision field”, adjustable examine whether the semantic similarity
during the clustering process. measure may help improve the performance of
The application of the proposed ant the ant-based text clustering methods. With the
clustering methods to text clustering, above consideration, the second contribution of
furthermore, leads to the discussion on the this paper is to combine the edge-counting
aspect of the associated document similarity similarity measure (Rada and Mili et al. 1989, Li
measure. Traditionally, the vector space model and Bandar et al. 2003) with the proposed
(VSM) (Salton and Wong et al. 1975) is used to ant-based clustering method to develop a novel
represent documents in text clustering, and text clustering method, and to preliminarily test
correspondingly the similarity between its feasibility and performance.
documents is usually assessed by the cosine The remainder of this paper is structured as
measure within the word vector space. Such follows. In Section 2 we present our modified
VSM-based similarity measure has limitations ant clustering algorithm, based on a brief
that may influence the performance of clustering description and analysis of the standard ant
algorithms. On one hand, because of the clustering algorithm. Section 3 discusses the
inevitable diversity of representing words of semantics-enhanced document similarity
different documents, the clustering process measure used in conjunction with the proposed
would inevitably take place in a ant clustering algorithm to text clustering.
high-dimensional space of word vectors. This Section 4 discusses a preliminary test to
would tremendously decrease the algorithmic compare the performance of the proposed
efficiency. To be worse, clustering in a algorithm with the standard ant clustering
high-dimensional space is difficult because each algorithm and the k-means algorithm. The paper
item tends to have the same distance with all the is finally concluded in Section 5.
other items (Beyer and Goldstein et al. 1999).
On the other hand, the semantics of each
476 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

2. Modified Ant Clustering Algorithm current cell to another cell in the grid. In
some variations of the Deneubourg-style
2.1 Standard Ant Clustering Algorithm ant clustering methods, the ant can only
Deneubourg and Goss et al.’s (1991) model move to an adjacent cell that is not
and the subsequent algorithm proposed by occupied by another ant; but in some other
Lumer and Faieta (1994) shaped the basic form variations, the ant can move across any
of the ant-based clustering methods; the other distance to any other unoccupied cell.
methods that are inspired by ants’ corpses-piling 3) Dropping: When the ant reaches a new cell,
and larvae-sorting activities are basically and if it carries some item (i.e. it is a
originated from these contributions. Thus we “laden” ant), the ant requires making
regard Lumer and Faieta’s algorithm as the another decision whether or not to drop the
“standard” ant-based clustering algorithm, laden item to this cell, if this arrived
which directly stems from Deneubourg and Goss position is not occupied by another item.
et al.’s original work. The ant’s decision depends on another
The standard ant-based clustering algorithm probability (called dropping probability,
first assumes that initially the items to be Pd ), which is another function of the
clustered are randomly laid down on a similarity between the laden item with the
two-dimensional m × m grid, where m depends items neighboring this newly-arrived cell.
on the number of items. Each cell in the grid More similar items exist in a local area
contains at most one item. A collection of around the cell, more likely the ant drops
artificial “ants” are also placed in the same grid the item.
at random. At this initial stage, each ant does not Repeating such activities, the ants may
“carry” any item. After completing such gradually devide different types of items into
initialization process, a cyclic process is different clusters. The overall process ends when
designed within which each ant sequentially the clusters become stable or a given maximal
conducts the following three activities at each iteration has been reached.
step: Obviously, the key factors of the above ant
1) Picking up: If the ant does not carry any clustering algorithm are the picking up and
item (i.e. the ant is an “unladen” ant), and dropping probability functions Pp and Pd. In
if it “encounters” an item oi, the ant decides Deneubourg and Goss et al.’s model, these two
to pick up or to ignore that item according functions are determined by the following
to a “picking up” probability Pp, which is a equations:
function of local density that determines Picking up probability,
the aggregate similarity of the item oi with k
Pp = ( 1 ) 2 (1)
its neighboring items. Lower the similarity k1 + f
is, more probably the item is picked up. Dropping probability,
2) Moving: After making the “picking up” f
Pd = ( )2 (2)
decision, the ant randomly moves from the k2 + f

Where f = f (oi ) is the aggregate similarity of the the range, it is often the case that large
item oi in its neighborhood, while k1 and k2 are clusters would not appear. Instead, if
threshold constants (picking threshold and assigning a big value to the range, the
dropping threshold, respectively). algorithm may not be able to eliminate the
The prior ant clustering algorithm has shown dissimilar items from a formed cluster.
good performance in some test cases (Handl and The limitations of the fixed values of the
Knowles et al. 2003). However, it is our own parameters in the standard ant clustering
experience that this algorithm does not algorithm have partly been addressed by a few
efficiently converge to a proper solution in other researches. For example, Vizine and de
various situations. Our critiques on the standard Castro et al. (2005) suggested the progressive
ant clustering algorithm are threefold: ant vision dependent on the aggregate similarity
• In the standard algorithm, the ant function (i.e. the function f in Equation (1)). The
movement is completely random, i.e., at same work also proposed a “cooling schedule”
each step each ant blindly moves toward a for the run-time adjustment of the picking
new position. Following this rule, many threshold, namely to geometrically decrease the
ants may probably do nothing helpful for picking threshold to 98 percent every 10000 ant
clustering items at many steps; and this steps. However, no much work on the
would hamper the algorithmic efficiency. Deneubourg-style ant clustering methods has
In some worse cases, the ants may even fail been presented to address the blind ant
to form the proper clusters within the movement problem. Chen and Xu et al.’s (2004)
affortable computation time. critique on this problem was similar to ours; but
• The ants’ activities on picking up and their solution, which was different from ours,
dropping documents are solely governed was to develop a brand new clustering model
by two fixed probability parameters upon the idea of gregarious ant cologies instead
(thresholds). However, it is often difficult of revising the Deneuburg model.
to find out a proper configuration of these
two thresholds so that the ants would 2.2 Modified Ant Clustering Algorithm
effectively find out a good solution. Based on the prior description and discussion,
• The range in which the aggregate we propose a modified ant-based clustering
similarity of the documents is assessed, or algorithm, trying to overcome the
the ants’ “vision field”, is another aforementioned limitations of the standard ant
important parameter that influences the clustering algorithm. Therefore, our
algorithmic performance. In the standard modifications of the standard ant clustering
algorithm, this range is also predefined and algorithm are on three aspects: the modified ant
fixed. This fixed value may sometimes movement rule, the modified setting of the ant
cause inappropriate behaviors in the actual vision, and the modified setting of the picking
clustering process as well. If an and dropping thresholds.
inappropriately small value is assigned to

2.2.1 Modified Ant Movement Rule of clusters and on the overall scale of the items
As previouly discussed, in the standard to be clustered. What’s more, to avoid the
algorithm, each ant arbitrarily moves from one situation that an ant picks up an item and
cell to another. Such ant movement may waste deposits it toward another position at which the
great computational resources as it can be aggregate similarity is even lower than that at
estimated that, in many steps, some ants take the original position, the aggregate similarity at
void actions useless for clustering. It is then the current position is also calculated. The
natural to expect that the alleviation of the ants’ calculated similarities are then compared with
void actions would improve the algorithmic each other, and the ant moves to the position
efficiency. The basic heuristics is to redesign the with the maximal aggregate similarity.
ant movement rule so as to direct a laden ant to Algorithmically, the movement rule of a laden
bring the carried item to a dense area of the ant is depicted by Figure 1.
same type of items, and to direct an unladen ant
to an area which contains an item dissimilar with 1. Calculate the aggregate similarity of the
carried item o at the current cell;
its surrounding items.
2. Split the grid into l*l sub-grids, and
To do so, we assume that each ant is able to randomly select an empty cell from each
access to a “map” which stores all documents’ sub-grid;
positions in the whole grid. This “map” is called 3. Calculate the aggregate similarity of the
item o at each of the selected cell;
a “global cache” of the document positions. 4. Select the cell that has maximal
With this global cache, a laden ant can aggregate similarity within the
theoretically estimate which area in the grid has randomly-selected cells and the current
cell (i.e. the original cell);
the maximal aggregate similarity for the ant’s 5. Move the ant with the carried item to the
currently-carried item; and this ant may then be selected cell.
directed toward that area. However, to get the
exact position with the maximal aggregate Figure 1 Movement rule of a laden ant
similarity, each laden ant has to calculate the
To note, in our approach, the aggregate
aggregate similarity at each position within the
similarity of a particular item oi at a particular
entire grid at each step. This would cost great
position is computed as the average similarity
computational load if the grid is large-scaled.
between the item and the surrounding items
Such situation is not desirable. Therefore, we
within the r-ranged Moore neighborhood of the
need to make a trade-off between the accuracy
item oi, that is:
of locating the positions with the maximal
aggregate similarity and the affordability of the ∑ sim(o, o j )
j
computational load. Our solution is then to split f (o) = (3)
N (o )
the entire grid to a few sub-grids and to
randomly select a cell to calculate the aggregate Where f(o) is the aggregate similarity of the item
similarity in each sub-grid. The number of the o in its Moore neighborhood; oj is a neighboring
sub-grids is depedent on the estimated number item of o, and sim(o,oj) is the similarity measure

between o and oj; finally, N(o) is the number of divided into three stages. At the initial stage,
cells in the Moore neighborhood, with is some small clusters would appear; at the
dependent on the range of the neighborhood r, intermediate stage, small clusters tend to merge
i.e. N(o) = r2. into larger clusters; and at the final stage, when
In contrast to the laden ants which tend to the large clusters become gradully stable, the
move toward an area with high aggregate phenomenon of grouping of small clusters
similarity, an unladen ant tends to move toward re-appears. In tune with this observation, we use
an area which contains an item dissimilar with an adjustable range of the Moore neighborhood.
the surrounding items. Based on this During the first one third iterations of the overall
consideration, the movement rule of an unladen process, a relatively small range, r, is set. Then,
ant can be described by Figure 2. during the second one third iterations, r is raised
by ∆r . During the last one third iterations, r is
1. Split the grid into l*l sub-grids, and decreased to the original value again.
randomly select a cell that is occupied
by some item from each sub-grid; 2.2.3 Run-time Adjustment of Picking and
2. Calculate the aggregate similarities of
the items that occupy the selected cells Dropping Thresholds
within their Moore neighborhoods; We argue that the adjustable “picking up”
3. Find the cell that has minimal aggregate and “dropping” probability parameters are also
similarity
4. Move the unladen ant to the found cell. helpful for improving the algorithmic
performance. For the dropping probability, we
assign an independent dropping threshold to
Figure 2 Movement rule of an unladen ant
each ant (initially the value of all ants’ dropping
threshold is set equally), instead of using a
2.2.2 Run-time Adjustment of Ant Vision uniform threshold for all ants. If an ant has
We also concern about the range of the failed to drop its carried item for several steps, it
Moore neighborhood in which the ants assess decreases its dropping threshold in order to
the aggregate similarity of documents. This increase the dropping probability, i.e.,
range of the Moore neighborhood can be decreasing k2 to k2 − ∆k until k2=0 ( thus the
regarded as the ants’ “vision”. Sharing Vizine item can definitely be dropped). When the item
and de Castro et al.’s (2005) view, we believe is finally dropped, the threshold is re-assigned to
that the fixed value of the ants’ vision is the original value.
inappropriate for clustering. However, we For the adjustment of the picking threshold,
suggest a different solution for dynamically the basic observation is that the clustering
adjusting the ant vision, which is based on the solution generated by the standard ant-based
observation that the overall clustering process clustering technique is not very stable in quite
can be divided into different stages. As also some situations, especially when setting a
analyzed by Handl and Knowles et al. (2003), relatively-high picking threshold. If the
the overall clustering process can roughly be threshold is inappropriately high, the ants tend to

pick items up from well-established clusters and 3. Semantics-Enhanced Similarity

therefore destructed the clusters. As a result, the Measure for Text Clustering
clusters may cyclically be constructed and The previous section proposes a modified ant
destructed, instead of converging into a stable clustering algorithm. In this section we further
solution. To overcome this, we follow Vizine discuss the application of the proposed
and de Castro et al.’s (2005) idea of “cooling algorithm to text clustering. As the performance
schedule” to decrease the picking threshold of any clustering algorithm highly relies on the
gradually when it is estimated that “good” similarity or distance measure between items,
clusters have formed. But our setting is slightly we concern the similarity measure used in text
different from theirs, to decrease the picking clustering. Especially we use the semantic
threshold (k1 in Equation (1)) to 90.0% of the similarity measure in text clustering, attempting
original level every 5,000 steps, unless the to overcome the limitations of the cosine
threshold has reached a minimal value 0.001. measure based on the bag-of-words (i.e.
VSM-based) representation of documents
2.2.4 Overall Ant Clustering Process (Baeza-Yates and Ribeiro-Neto 1999, p27-28),
Based on the prior descriptions about the which is widely used by the existing ant-based
modifications of the standard ant clustering text clustering methods.
algorithm, the overall process of the proposed In general, the semantic document similarity
algorithm can be described by Figure 3. measures can be categorized into two schools,
namely the edge-counting-based measures and
1 Initialize the grid and the ants;
the information-content-based measures. One
2 Calculate the similarities between
items and form the similarity matrix; cornerstone of the edge-counting-based
3 Start the iteration of ant behaviors measures is Rada and Mili et al.’s (1989) work,
3.1 For each ant, conduct the which proved that the minimal number of edges
following activities in sequence:
separating two concepts within a lexical
3.1.1 Make picking-up decision
and adjust the picking taxonomy of “is-a” links could serve as a metric
threshold every 5,000 ant for measuring the conceptual distance of these
steps; two concepts. Following this idea, various
3.1.2 Make ant movement
edge-counting methods for document similarity
decision;
3.1.3 Make dropping decision measure have been proposed. In contrast,
and adjust the dropping following the contributions of Resnik (1995) and
threshold; Lin (1998), the basic idea behind the
3.2 Adjust the ant vision according
information-content-based methods is to define
to the specified stages;
4 End the iteration after pre-defined the similarity between two concepts as the
steps. maximum of the information context of the
concept that subsumes them in the taxonomic
Figure 3 The overall process of ant clustering hierarchy, which can further be calculated by the
information-theoretic entropy of the subsumer

(i.e. the closest common ancestor of the two independent functions of g1 and g2.
concepts) within a corpus. As discussed by Apparently, the similarity between two
various researchers (e.g. Li and Bandar et al. concepts would monotonically decrease as the
2003), both categories of methods have their path in the ontology to connect them increases.
advantages and limitations, and this research Furthermore, it would be reasonable to expect
direction is still under rapid development. the similarity would decrease at an exponential
In this paper, our aim is to apply some rate; therefore g1 is defined by Equation (5):
appropriate semantic similarity measure to g1 ( l ) = e − α l (5)
enhance ant-based text clustering. We, at the Where α is real constant between 0 and 1.
current stage, choose the edge-counting The depth of the subsumer is derived by
approach for semantic similarity measure, counting the shortest length of links from the
practically due to the lack of a well-formed subsumer to the root concept of the ontology.
corpus, which is necessary for the information- The intuitive observation of the impact of the
content-based approach, in testing our algorithm, depth on the similarity measure is that the
With a few modifications and simplifications (in concepts at higher levels in the lexical hierarchy
order to decrease the computational complexity) contain less semantic information content, and
from the method given by Li and Bandar et al. therefore two adjacent concepts at a higher level
(2003), the similarity measure to be adopted in should have less semantic similarity comparing
this paper is described as follows. with two adjacent concepts at a lower level. For
The proposed semantic similarity measure is example, consider a fraction of an ontology
based on a lexical hierarchy or ontology that is about animal classification. Within this ontology,
comprised of concepts inter-connected through “canine” and “feline” both belong to “carnivore”;
hyponymy (“is-a”) links. We take into account and two lower level concepts of “leopard” and
two factors for calculating the similarity “tiger” both belong to the animal category of
between two concepts in the ontology: the path “feline”. In this case, people would usually
length between the two concepts, and the depth consider that the similarity between “tiger” and
of the subsumer in the hierarchy. That means, if “leopard” are higher than that between “canine”
the two concepts the similarity of which we are and “feline”. In terms of this observation, a
going to calculate are c1 and c2, then the monotonically increasing function of depth h to
similarity is a function denoted by Equation (4): similarity can then be defined:
sim(c1 , c2 ) = g1 (l ) ⋅ g 2 (h) (4) e β h − e− β h
g 2 ( h) = (6)
Where l is the shortest path connecting c1 and c2; e β h + e− β h
and h is the depth of the subsumer of c1 and c2. Where β > 0 is a constant.
Here in Equation (4) it is assumed that the So far we have got the the semantic
impacts of parameters l and h on the similarity similarity between two concepts. The semantic
are independent from one another so that the similarity between two documents can
similarity function is comprised of two accordingly be computed. We use a set of

“concepts” to form a semantic representation of measure, a document is represented by a

a document. Assume document d1 is represent collection of words or index terms with the
by concepts (c1,1, c1,2, …,c1,m), and document d2 corresponding weights, which can usually be
is represent by concepts (c2,1, c2,2, …,c2,n), the calculated by the TFIDF (term frequency/
semantic similarity between these two inverse document frequency) weighting scheme
documents is then assessed by the following (Salton and Fox et al. 1983). For a series of
function: documents, all the representing words form a
m n
vector space, and each document can be
sims ( d1 , d 2 ) = ( ∑ ∑ sim(c1,i , c2, j )) mn (7)
i =1 j =1 represented as a vector within this space:
The prior model serves as the basic metric to di = ( w1 , w2 ,..., wn ) (9)
assess the similarity between documents in our Where, n is the dimension of the vector space,
text clustering algorithm. However, if solely and w1, w2, …, wn are the weights of the index
using this semantic measure, the algorithm may terms. wk=0 if the k-th index term in the vector
encounter difficulties in real-life applications, space is not used to represent the specific
due to the uncertain quality of the available document. Consequently, the cosine measure of
lexical taxonomies, which may not be able to the similarity between two documents is defined
accurately portray all the needed concepts and by Equation (10):
their relations. Therefore, we practically use a di ⋅ d j
simw (di , d j ) =
hybrid model that combines the ontology-based | di | × | d j |
and VSM-based measures, so that the similarity n n n
= (∑ wi,t ⋅ w j ,t ) ( ∑ wi2,t ⋅ ∑ w2j,t ) (10)
of two documents can still be be partially t =1 t =1 t =1
assessed by the VSM-based measure even in the
case that the ontology-based measure does not 4. Evaluation Experiments
work well. The suggested hybrid model is In this section the test results of the proposed
formulated in Equation (8). algorithm is reported. Although our current test
sim(d1, d2 )
is still primitive, with a relatively small-sized
= λsimw (d1, d2 ) + (1− λ)sims (d1, d2 ) (8)
dataset being used, the positive results have
Equation (8) denotes that the similarity between partially indicated the benifits of the proposed
two documents is the weighted summation of algorithm.
the VSM-based similarity (i.e., simw(d1,d2)) and In the current test, 50 documents are
the ontology-based semantic similarity (i.e., arbitrarily excerpted from the Routers-21578
sims(d1,d2)). The weight λ depends on the corpus, which is one of the most-widely adopted
quality of the ontology being used. benchmarking datasets in the text mining field
The VSM-based similarity measure used in (Lewis 2006). The selected documents cover 5
Equation (8) is the standard cosine measure. topics of “gas”, “gold”, “livestock”, “grain”, and
Here we outline the basic formula to calculate “money”, with each topic containing 10
the VSM-based similarity. In this similarity documents. The keywords (concepts) to

represent these documents are extracted by using “concepts” and calculate the semantic similarity
the text mining tool, TextAnalystTM (Megaputer between two documents according to the
Intelligence Inc. 2006). Furthermore, WordNet® conceptual distance of the representing concepts
(Miller 1995) is used as the base ontology to in WordNet. Setting the parameters as shown in
analyze the semantic similarity between Table 1, sample comparison results of pairs of
concepts, as well as between documents. documents are illustrated in Tables 2, and 3,
With these resources and tools, we test our respectively.
algorithms in two steps: 1) testing the feasibility
Table 1 Parameter setting for similarity measure
and performance of using ontology-based
semantic similarity measure; and 2) testing the Parameter Description of Parameter Value
performance of our revised algorithm, α As described in Equ. (5) 0.08
comparing with the standard ant-based text β As described in Equ. (6) 0.6
clustering algorithm. λ As described in Equ. (8) 0.2
upper- If the path length between 12
4.1 Test of Semantic Similarity Measure length two concepts is larger than
this given number, their
To calculate the semantic similarity between similarity is set to 0
the selected documents, in the current test we
directly treat the TextAnalyst-generated terms as
Table 2 Partial results of the similarity measure for homogeneous documents

VSM-based Cosine Measure Ontology-based Measure
Doc # Gas1 Gas2 Gas3 Gas4 Gas5 Gas1 Gas2 Gas3 Gas4 Gas5
Gas1 1.0 0.0 0.0 0.0 0.0 1.0 0.79604 0.59636 0.79604 0.79604
Gas2 0.0 1.0 0.0 0.8858 0.1609 0.79604 1.0 0.59636 0.97716 0.83020
Gas3 0.0 0.0 1.0 0.0 0.5454 0.59636 0.59636 1.0 0.59636 0.70644
Gas4 0.0 0.8858 0.0 1.0 0.1721 0.79604 0.97716 0.59636 1.0 0.83245
Gas5 0.0 0.1609 0.5454 0.1721 1.0 0.79604 0.83020 0.70644 0.83245 1.0
Table 3 Partial results of the similarity measure for heterogeneous documents

VSM-based Cosine Measure Ontology-based Measure
Doc # Gas1 Gas2 Gas3 Gas4 Gas10 Gas1 Gas2 Gas3 Gas4 Gas10
Livestock1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.480 0.487 0.0
Livestock2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.48988 0.48767 0.0
Livestock3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.64434 0.42996 0.0
Livestock4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.48988 0.48887 0.0
Livestock10 0.0 0.0 0.0 0.0 0.0069 0.0 0.0 0.525 0.56290 0.00138

Table 2 compares the similarities between the similarities between a pair of homogeneous
pairs of a few homogeous documents in the documents. As shown in Table 2, the similarity
“gas” category. The left part of the table of a pair of “gas” documents may be less than
illustrates the calculation results by the 0.6 (e.g. the similarity between the documents
VSM-based cosine measure; and the right part “gas2” and “gas3”).
illustrates the results by the ontology-based The possible reasons for this phenomenon
semantic similarity measure. The results show are twofold. On one hand, the Reuters-21578
that the ontology-based measure performs collection is essentially comprised of
significantly better than the cosine measure in economics-related documents. Two documents
the sense that the latter approach fails to give an may factually have semantic relations with on
appropriate assessment for the similarity another although being artificially classified into
between two semantically relevant documents two categories. On the other hand, the used
represented by different terms but the former ontology, WordNet, is a general-purposed lexical
approach handles such situation well. database or a shallow ontology, which might not
Table 3 illustrates the calculation results of accurately describe the sophisticated
the similarities between heterogeneous semantic-relations between the documents in the
documents. We select five documents in the specialized domains. In spite of these reasons,
category “livestock” and five documents in the the results in Table 3 show that the
category “gas”. The similarities between the ontology-based measure may have the limitation
most pairs of documents of different categories on distinguishing the dissimilar documents from
are zero by the cosine measure. In our similar ones. What’s more, to use the
calculation, only the cosine similarity measure ontology-based measure, the coefficients in
between “livestock10” and “gas10” is non-zero. Equations (5) and (6) must be set properly. For
By the ontology-based similarity measure, on example, if the value of α is too small, the
the contrary, the similarities between a few pairs effect of conceptual distance on the similarity
of heterogeneous documents are not would be weakened, and the semantically
significantly smaller than the similarities similar documents would be blurred with the
between pairs of homogeneous documents dissimilar ones.
although this ontology-based measure on the To sum up, the experiment illustrates that the
average performs well to distinguish the proposed ontology-based measure generally
different categories of documents. As shown in increases the accuracy of similarity measure.
Table 3, the ontology-based similarities between Especially, the ontology-based measure shows
a “livestock” document and one “gas” document great advantages over the traditional cosine
of “gas1”, “gas2” and “gas10” is zero or measure in assessing the semantically similar
relatively small; however, the similarities documents. However, in practice, there is the
between a “livestock” document and the danger that similar documents may not be
documents of “gas3” and “gas4” are basically precisely distinguished from the dissimilar ones,
around 0.5, which are not perfectly distinct from if the coefficents α and β are not properly

set, or the used ontology is too shallow or too contains the documents about “money”; the
narrow. To partly alleviate this, in our ant-based second contains the documents about “gold” and
text clustering algorithm, we use a hybrid model “livestock”; the third basically contains
of ontology-based and VSM-based measures, documents about“grain”; and the fourth contains
instead of a “pure” semantic measure, as
currently none of the existing ontologies would
completely fit the needs of all text mining
problems.
4.2 Performance Analysis of the Modified

Ant Algorithm
We test the performance of the proposed
clustering algorithm with the aforementioned
dataset. The basic parameters of the algorithm
are set as shown in Table 4.
Table 4 Parameter setting for the ant algorithm

: Livestock : Money : Gas
Parameter Value : Grain : Gold
Grid Size 15*15
Count of Ants 12 Figure 4 Initial distribution of the documents
k1 : Initial picking threshold 0.35
k2 : Initial dropping threshold 0.35
∆k : variation of picking up threshold 0.02
r: Initial Moore Neighborhood 3
1
∆r : the neighborhood increment 4
With the above setting, the proposed

algorithm is tested. Figures 4 and 5 respectively
illustrate the initial distribution and the 2
distribution after 10,000 iterations of the given
3
documents in an execution of the proposed
algorithm. 4
For the convenience of observation, in
Figure 4 and 5 we distinguish different types of Figure 5 Distribution after 10,000 iterations
documents by different filling patterns of the
cells, which are explained in the legend part of documents about “gas”. This clustering result
Figure 4. As shown in Figure 5, four clusters are basically reflects the predefined categories,
generated by our algorithm: the first cluster except that the documents about “grain” and

“gas” are within the same cluster (i.e. in cluster The recall and the precision of the proposed
2). algorithm are then compared with those of the
In order to numerically evaluate the standard ant clustering algorithm. As we suggest
performance of the proposed algorithm, we modifying the standard ant-based text clustering
furthermore calculate the precision and recall of in two dimensions, namely the modification on
the clustering result. We define the precision of the similarity measure and the modification on
the cluster “j” in the type “i” (predefined for the ant clustering process, we hope to evaluate
the purpose of testing) as: the effects of both modifications. To do so, we
precision(i, j ) = N ij N j (11) test four sets of combinations of the similarity
measure and the ant clustering process: standard
And the recall as: ant clustering with VSM-based similarity
recall (i, j ) = N ij N i (12) measure (SAVS), standard ant clustering with
ontology-based similarity measure (SAOS),
Where, Nij refers to the number of documents in modified ant clustering with VSM-based
the cluster j that belongs to the type i; Ni refers similarity measure (MAVS), and modified ant
the number of documents in the type i; and Nj is clustering with ontology-based similarity
the number of documents in the cluster j. With measure (MAOS). The recall and precision of
these two definitions, the precision and recall of the each method are shown in Table 6 and 7,
the different categories of documents in our respectively.
experiment can be calculated.
The clustering result shown in Figure 5 is the Table 6 Recall comparison of the four methods
result of a typical execution of our proposed SAVS SAOS MAVS MAOS
algorithm on the test dataset. Actually, we do the gas 0.36667 0.76667 0.30000 0.71429
experiment on the same dataset multiple times, gold 0.30000 0.93333 0.39000 0.90000
obtaining somewhat different clustering results grain 0.63333 0.85000 0.37000 0.88571
every time(as the proposed clustering algorithm livestock 0.26667 0.96667 0.28000 1.0000
is in nature a nondeterministic approach). The money 0.36667 0.71667 0.36000 1.0000
average recall and precision of the repeated Average 0.38667 0.84667 0.34000 0.89999
experiments are shown in Table 5.
Table 5 Recall and precision of the test example Table 7 Precision comparison of the four methods
Type Recall Precision SAVS SAOS MAVS MAOS

gas 0.71429 0.75000 gas 0.39000 0.33833 0.43000 0.75000
gold 0.90000 0.82143 gold 0.44000 0.72667 0.45000 0.82143
grain 0.88571 0.66000 grain 0.41000 0.32500 0.42000 0.66000
livestock 1.00000 0.52336 livestock 0.32667 0.49833 0.32000 0.52336
money 1.00000 0.45220 money 0.44667 0.34833 0.45000 0.45222
Average 0.89999 0.64140 Average 0.40267 0.44733 0.41400 0.64140

The comparisons in Tables 6 and 7 depict that the proposed algorithm, at least in our test
that our modifications in the both dimensions case, has better performance than the standard
contribute to improve the clustering ant clustering algorithm and the k-means
performance. Especially, the contribution of the algorithm.
ontology-based similarity measure is significant
Table 8 Comparison with the k-means method
for both improving the recall and precision. The
contribution of the modified ant clustering Recall Precision
process performs is mainly on the improvement MAOS k-means MAOS k-means
of the clustering precision, whereas the effect of gas 0.71429 0.42000 0.75000 0.67920
the modified ant clustering process on the recall gold 0.90000 0.32000 0.82143 0.63800
is not so significant. This result basically fits our grain 0.88571 0.36000 0.66000 0.59000
anticipation when designing the algorithm. The livestock 1.0000 0.54000 0.52336 0.56500
main purpose of the modification of the ant money 1.0000 0.62000 0.45222 0.80000
clustering process is to increase the convergence Average 0.89999 0.45200 0.64140 0.65444
speed of the algorithm; and we expect the

improvement of the recall and the precision 5. Conclusions
basically by using the proposed hybrid similarity In this paper, we present a modified
measure. Our experiments show that the ant-based clustering algorithm, and apply it to
modified ant clustering process has advantage the field of text clustering with the help of the
on the algorithmic efficiency over the standard ontology-based semantic similarity measure
ant-clustering process. Using our ant-clustering between documents. The proposed algorithm is a
algorithm, good clusters are formed and revised version of an algorithm the authors
stabilized after about 5,000 ant steps, whilst the proposed earlier (Xia and Wang et al. 2006),
standard ant clustering algorithm usually reaches trying to increase its scalability to cater for
the convergence after 20,000 ant steps. larger datasets. Our experiments on the proposed
We also compare the proposed algorithm algorithm show that on one hand, the proposed
with the k-means method, which can be algorithm efficiently converges to reasonably
regarded as today’s benchmarking clustering good clusters, comparing with the standard
technique. The recall and precison comparisons ant-based text clustering algorithm; on the other
between our algorithm and the k-means hand, the recall and precision of the clustering
algorithm are illustrated in Table 8. results of the proposed algorithm are higher than
The results in Table 8 show that the the standard ant-based text clustering algorithm
clustering precision of the proposed algorithm is and the k-means algorithm. These results partly
basically at the same level as that of the k-means indicate that it may be worthwhile to give
algorithm; however, in terms of the recall, the further investigations on the proposed algorithm.
proposed ant clustering algorithm performs Currently, the authors are still on the way to
apparently better than the k-means algorithm. test the proposed algorithm and to furthermore
In summary, the comparison result indicates improve the algorithm. Three issues are of our

central concern. First, we are ongoing to would like to thank the Journal’s managing
evaluate the usefulness and performance of the editor, Prof. Jian Chen, for his hard work and
proposed algorithm with more datasets. kind help.
Especially, as we observe in the experiments
being discussed in the present paper, the References
WordNet ontology may not be precise enough to [1] Ankerst, M., Breunig, M., Kriegel, H.P. &
assess the semantic similarities between Sander, J. (1999). OPTICS: Ordering
Reuters-21578 texts; and this may possibly points to identify clustering structure. In:
influence the performance of the proposed Proceedings of the ACM SIGMOD
algorithm. We are considering testing our Conference, pp. 49-60
algorithm with a more elaborate ontology [2] Baeza-Yates, R. & Ribeiro-Neto, B. (1999).
together with a text collection that can be more Modern Information Retrieval. Addison-
accurately described by that ontology. A Wesley, Boston, MA, USA
candidate pair of text collection and ontology is [3] Beni, G. & Wang, U. (1989). Swarm
a text collection excerpted from the Medline intelligence in cellular robotic systems. In:
database (Linberg and Siegel et al. 1993) and the Proceedings of NATO Advanced
Galen ontology (Rector and Gangemi et al. Workshop on Robots and Biological
1994). The second concern on our future work is Systems. Tuscany, Italy, 1989
to test the use of the corpus-based or [4] Berry, M. (ed.) (2003). Survey of Text
information-content-based similiarity measure in Mining: Clustering, Classification, and
our algorithm, to explore the possible Retrieval. Springer, New York
improvement of the overall algorithmic [5] Beyer, K. Goldstein, J. Ramakrishnan, R.
performance. Finally, at the present work, we & Shaft, U. (1999). When is “nearest
basically concern the Deneubourg-style ant neighbor” meaningful? In: Proceedings of
clustering model; extensive evaluations and 7th International Conference of Database
comparisons with other ant-behavior-inspired Theory (ICDT99), LNCS 1540, pp.
clustering models, e.g., Labroché and 217-235. Springer, Berlin
Monmarché et al.’s (2002) algorithm based on [6] Bonabeau, E., Dorigo, M. & Theraulaz, G.
the chemical recognition system of ants, and (1999). Swarm Intelligence: From Natural
Chen and Xu et al.’s (2004) algorithm on the ant to Artificial Systems. Oxford University
sleeping model, are also an interesting direction Press, New York
for future work. [7] Chen, L., Xu, X. & Chen, Y. (2004). An
adaptive ant colony clustering algorithm.
Acknowledgments In: Proceedings of the Third International
The authors are grateful to Prof. Zhongtuo Conference on Machine Learning and
Wang for helpful comments on this paper. The Cybernetics (ICMLC04), pp.1387-1392
kind recommendations from the anonymous [8] Chiou, Y-C. & Lan, L.W. (2000). Genetic
referees are warmly appreciated. Finally, we clustering algorithms. European Journal of

Operational Research, 135: 413-427 [15] Jain, A.K., Murty, M.N. & Flynn, P.J.
[9] Deneubourg, J.L., Goss, S., Franks, N., (1999). Data clustering: a review. ACM
Sendova-Franks, A., Detrain, C. & Chétien, Computing Surveys, 31(3): 264-323
L. (1991). The dynamics of collective [16] Jing, L., Zhou, L., Ng, M. K. & Huang, J.Z.
sorting: robot-like ants and ant-like robots, (2006). Ontology-based distance measure
In: Proceedings of the 1st International for text clustering. In: Proceedings of the
Conference on Simulation of Adaptive 4th Workshop on Text Mining, 6th SIAM
Behaviour, pp. 356-363, MIT Press, International Conference on Data Mining
Cambridge, MA [17] Kanade, P. & Hall, L.O. (2003). Fuzzy
[10] Handl, J., Knowles, J. & Dorigo, M. ants as a clustering concept. In:
(2003). On the performance of ant-based Proceedings of the 22nd International
clustering. In: Proceedings of the Third Conference of the North American Fuzzy
International Conference on Hybrid Information Processing Society (NAFIPS),
Intelligent Systems. pp. 204-213, IOS pp. 227-232
Press [18] Kuntz, P., Snyers, D. & Layzell, P. (1998).
[11] Handl, J. & Meyer, B. (2002). Improved A stochastic heuristic for visualising graph
ant-based clustering and sorting in a clusters in a bi-dimensional space prior to
document retrieval interface. In: partitioning. Journal of Heuristics, 5:
Proceedings of the Seventh International 327-351
Conference on Parallel Problem Solving [19] Labroché, N. Monmarché, N. & Venturini,
from Nature (PPSN VII), pp. 913–923, G. (2002). A new clustering algorithm
Springer-Verlag, Berlin based on the chemical recognition system
[12] Hartigan, J. & Wong, M. (1979). of ants. In: Proceedings of the 2002
Algorithm AS136: A k-means clustering European Conference on Artificial
algorithm. Applied Statistics, 28: 100-108 Intelligence, pp. 345-349
[13] Hoe, K., Lai, W. & Tai, T. (2002). [20] Lewis, D. (2006). Reuters-21578 text
Homogeneous ants for Web document categorization test collection. Available via:
similarity modeling and categorization. In: http://www.daviddlewis.com/resources/test
Proceedings of the Seventh International collections/reuters21578. Cited Nov. 10,
Conference on Parallel Problem Solving 2006.
from Nature, pp. 256-261 [21] Li, Y., Bandar, Z. & McLean, D. (2003).
[14] Hotho, A., Staab, S. & Stumme, G. (2003). An approach for measuring semantic
Wordnet improves text document similarity between words using multiple
clustering. In: Proceedings of the Semantic information sources. IEEE Transactions on
Web Workshop at SIGIR-2003, 26th Knowledge and Data Engineering, 13:
Annual International ACM SIGIR 871-882
Conference, Toronto, Canada. July [22] Lin, D. (1998). An information-theoretic
28-August 1, 2003 definition of similarity. In: Proceedings of

the Fifteenth International Conference on and Technology (WSTST'05), pp. 977-986,

Machine Learning, pp.296-304 Springer-Verlag, Berlin
[23] Lindberg, D. A., Siegel, E. R., Rapp, B. A., [30] Ramos, V. & Merelo, J. (2002).
Wallingford, K. T. & Wilson, S. R. (1993). Self-organized stigmergic document maps:
Use of MEDLINE by physicians for environments as a mechanism for context
clinical problem solving. Journal of the learning. In: Proceedings of the First
American Medical Association, 269: Spanish Conference on Evolutionary and
3124-3129 Bio-Inspired Algorithms. pp. 284–293
[24] Lumer, E. & Faieta, B. (1994). Diversity [31] Rector, A., Gangemi, A., Galeazzi, E.,
and adaption in populations of clustering Glowinski. A. & Rossi-Mori A. (1994).
ants. In: Proceedings of the Third The GALEN CORE model schemata for
International Conference on Simulation of anatomy: towards a reusable
Adaptive Behaviour, MIT Press, application-independent model of medical
Cambridge, MA concepts. In: Proceedings of Medical
[25] Megaputer Intelligence Inc. (2006). Informatics Europe (MIE94)
Online introduction to TextAnalystTM. [32] Resnik, P. (1995). Using information
Available via: content to evaluate semantic similarity in a
http://www.megaputer.com/products/, taxonomy. In: Proceedings of 14th
Cited Nov. 12, 2006 International Joint Conference on Artificial
[26] Miller, G. (1995). Wordnet: a lexical Intelligence
database for English. Communications of [33] Salton, G., Fox, E. & Wu, H. (1983).
the ACM, 38: 39-41 Extended boolean information retrieval.
[27] Monmarché, N. (1999). On data clustering Communications of the ACM, 26:
with artificial ants. In: AAAI-99 and 1022-1036
GECCO-99 Workshop on Data Mining [34] Salton, G., Wong, A. & Yang, C. S. (1975).
with Evolutionary Algorithms: Research A vector space model for automatic
Directions, pp.23-26 indexing. Communications of the ACM,
[28] Rada, R., Mili, H., Bicknell, E. & Bletiner, 18(11): 613-620
M. (1989). Development and application [35] Sebastiani, F. (2002). Machine learning in
of a metric on semantic nets. IEEE automated text categorization. ACM
Transactions on Systems, Man and Computing Surveys, 34: 1-47.
Cybernetics, 19:17-30 [36] Vizine, A.L., de Castro, L.N., Hruschka1,
[29] Ramos, V. & Abraham, A. (2005). E.R. & Gudwin, R.R. (2005). Towards
ANTIDS: self organized ant based improving clustering ants: an adaptive ant
clustering model for intrusion detection clustering algorithm. Informatica, 29:
system. In: Proceedings of The Fourth 143-154
IEEE International Workshop on Soft [37] Ward, J.H. (1963). Hierarchical grouping
Computing as Transdisciplinary Science to optimize an objective function. Journal

of the American Statistical Association, 58: information systems, knowledge management

236-244 systems, and complex adaptive systems.
[38] Xia, H., Wang, S. & Yoshida, T. (2006).
Toward a revised ant-based text clustering Shuguang Wang Software engineer at BHR-
algorithm. In: Proceedings of 7th Frontline technologies (Dalian) CO., LTD. He
International Symposium on Knowledge received his master’s degree from Institute of
and Systems Sciences, pp. 159-166, Systems Engineering, Dalian University of
Global-Link Publisher, Hong Kong Technology in 2006. His research interests are
on data clustering, text mining and evolutionary
Haoxiang Xia Associate professor at Dalian algorithms.
University of Technology. He obtained his Ph.D.
degree from Institute of Systems Engineering, Taketoshi Yoshida Professor at Japan Advanced
Dalian University of Technology (DUT) in 1998. Institute of Science and Technology. He received
Before working at DUT since 2000, he was a his Ph.D. degree from the department of
postdoctoral fellow at Institute of Systems Systems Engineering, Case Western Reserve
Science, the Chinese Academy of Sciences from University in 1984. He worked for IBM Japan
1998 to 2000. He worked at Japan Advanced from 1985 to 1997. His research interests are in
Institute of Science and Technology as a visiting systems science and knowledge-handling
associate professor from 2004 to 2006. His information systems.
major research interests include Internet-based

A Modified Ant-Based Text Clustering Algorithm With Semantic Similarity Measure

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Modified Ant-Based Text Clustering Algorithm With Semantic Similarity Measure

Uploaded by

Copyright:

Available Formats

J Syst Sci Syst Eng(Dec 2006) 15(4): 474-492 ISSN: 1004-3756 (Paper) 1861-9576 (Online)

DOI: 10.1007/s11518-006-5029-z CN11-2983/N

A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH

Haoxiang XIA1 Shuguang WANG2 Taketoshi YOSHIDA3

1. Introduction with an item outside the group. Such groups are

© Systems Engineering Society of China & Springer-Verlag 2006

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 475

476 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 477

478 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 479

480 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

pick items up from well-established clusters and 3. Semantics-Enhanced Similarity

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 481

482 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

“concepts” to form a semantic representation of measure, a document is represented by a

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 483

Table 2 Partial results of the similarity measure for homogeneous documents

Table 3 Partial results of the similarity measure for heterogeneous documents

484 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 485

4.2 Performance Analysis of the Modified

Table 4 Parameter setting for the ant algorithm

With the above setting, the proposed

486 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

Type Recall Precision SAVS SAOS MAVS MAOS

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 487

process performs is mainly on the improvement MAOS k-means MAOS k-means

speed of the algorithm; and we expect the

488 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 489

490 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

the Fifteenth International Conference on and Technology (WSTST'05), pp. 977-986,

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 491

of the American Statistical Association, 58: information systems, knowledge management

492 JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

You might also like