P. 1
Text Clustering Based on Frequent Items Using Zoning and Ranking

Text Clustering Based on Frequent Items Using Zoning and Ranking

|Views: 218|Likes:
Published by ijcsis
In today’s information age, there is an incredible nonstop growth in the textual information available in electronic form. This increasing textual data has led to the task of mining useful or interesting frequent itemsets (words/terms) from very large unstructured text databases and this task still seems to be quite challenging. The use of such frequent association for text clustering has received a great deal of attention in research communities since the mined frequent itemsets reduces the dimensionality of the documents drastically. In this work, an effective approach for text clustering is developed in accordance with the frequent itemsets that provides significant dimensionality reduction. Here, Apriori algorithm, a well-known method for mining the frequent itemsets is used. Then, a set of non-overlapping partitions are obtained using these frequent itemsets and the resultant clusters are generated within the partition for the document collections. An extensive analysis of frequent item-based text clustering approach is conducted with a real life text dataset, Reuters-21578. The experimental results of the frequent item-based text clustering approach for 100 documents of Reuters-21578 dataset are given, and the performance of the same has been evaluated with Precision, Recall and F-measure. The results ensured that the performance of the proposed approach improved effectively. Thus, this approach effectively groups the documents into clusters and mostly, it provides better precision for dataset taken for experimentation.
In today’s information age, there is an incredible nonstop growth in the textual information available in electronic form. This increasing textual data has led to the task of mining useful or interesting frequent itemsets (words/terms) from very large unstructured text databases and this task still seems to be quite challenging. The use of such frequent association for text clustering has received a great deal of attention in research communities since the mined frequent itemsets reduces the dimensionality of the documents drastically. In this work, an effective approach for text clustering is developed in accordance with the frequent itemsets that provides significant dimensionality reduction. Here, Apriori algorithm, a well-known method for mining the frequent itemsets is used. Then, a set of non-overlapping partitions are obtained using these frequent itemsets and the resultant clusters are generated within the partition for the document collections. An extensive analysis of frequent item-based text clustering approach is conducted with a real life text dataset, Reuters-21578. The experimental results of the frequent item-based text clustering approach for 100 documents of Reuters-21578 dataset are given, and the performance of the same has been evaluated with Precision, Recall and F-measure. The results ensured that the performance of the proposed approach improved effectively. Thus, this approach effectively groups the documents into clusters and mostly, it provides better precision for dataset taken for experimentation.

More info:

Published by: ijcsis on Jul 07, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/07/2011

pdf

text

original

Text Clustering Based on Frequent Items Using

Zoning and Ranking

S. Suneetha
1
, Dr. M. Usha Rani
2
, Yaswanth Kumar.Avulapati
3

Research Scholar
1
, Associate Professor
2
,
Department of Computer Science, SPMVV, Tirupati.
Research Scholar
3
, Dept of Computer Science, S.V.University, Tirupati
suneethanaresh@yahoo.com
1
, musha_rohan@yahoo.com
2
, Yaswanthkumar_1817@yahoo.co.in
3

Abstract— In today’s information age, there is an incredible
nonstop growth in the textual information available in electronic
form. This increasing textual data has led to the task of mining
useful or interesting frequent itemsets (words/terms) from very
large unstructured text databases and this task still seems to be
quite challenging. The use of such frequent association for text
clustering has received a great deal of attention in research
communities since the mined frequent itemsets reduces the
dimensionality of the documents drastically. In this work, an
effective approach for text clustering is developed in accordance
with the frequent itemsets that provides significant
dimensionality reduction. Here, Apriori algorithm, a well-known
method for mining the frequent itemsets is used. Then, a set of
non-overlapping partitions are obtained using these frequent
itemsets and the resultant clusters are generated within the
partition for the document collections. An extensive analysis of
frequent item-based text clustering approach is conducted with a
real life text dataset, Reuters-21578. The experimental results of
the frequent item-based text clustering approach for 100
documents of Reuters-21578 dataset are given, and the
performance of the same has been evaluated with Precision,
Recall and F-measure. The results ensured that the performance
of the proposed approach improved effectively. Thus, this
approach effectively groups the documents into clusters and
mostly, it provides better precision for dataset taken for
experimentation.

Keywords— Text Mining, Text Clustering, Text Documents,
Frequent Itemsets, Apriori Algorithm, Reuters-21578.
I. INTRODUCTION
The current age is referred to as the “Information Age”. In
this information age, information leads to power and success,
only if one can “Get the Right Information, To the Right
People, At the Right Time, On the Right Medium, In the Right
Language, With the Right Level of Detail”. The abundance of
data, coupled with the need for powerful data analysis tools is
described as “Data Rich but Information Poor” situation. In
order to relieve such a data rich but information poor dilemma,
a new discipline named data mining emerged, which devotes
itself to extracting knowledge from huge volumes of data,
with the help of the ubiquitous modern computing devices.
The term “Data Mining” also known as Knowledge Discovery
in Databases (KDD) is formally defined as: “the non-trivial
extraction of implicit, previously unknown, and potentially
useful information from large amounts of data” [13]. Data
mining is not specific to one type of media or data. It is
applicable to any kind of information repository.
Generally, data mining is performed on data represented in
quantitative, textual, or multimedia forms. In recent times,
there is an increasing flood of unstructured textual information.
The area of text mining is growing rapidly mainly because
of the strong need for analyzing this vast amount of textual
data. As the most natural form of storing and exchanging
information is written words, text mining has a very high
commercial potential [9], [11]. So, it is regarded as the next
wave of knowledge discovery. Traditional document and text
management tools are inadequate to meet these utilities.
Document management systems work well with homogeneous
documents but not with the heterogeneous mix. Even the best
internet search tools suffer from poor precision and recall. The
ability to distil this untapped source of information, free text
document, provides substantial competitive advantages to
succeed in the era of a knowledge-based economy. Thus, Text
Mining provides a competitive edge for a company to process
and take advantage of massive textual information.
Text Mining, also known as Text Data Mining or
Knowledge Discovery from Textual Databases, is defined as,
“the nontrivial extraction of implicit, previously unknown,
and potentially useful information from textual data” [3] Or
“the process of extracting interesting and non-trivial patterns
or knowledge from unstructured text documents”. 'High
Quality' in text mining refers to some combination of
relevance, novelty, and interestingness [6].
‘Text Clustering’ or ‘Document Clustering’ is ‘the
organization of a collection of text documents into clusters
based on similarity. Intuitively, documents within a valid
cluster are more similar to each other than those belonging to
a different cluster’. In other words, documents in one cluster
share similar topics. Thus, the goal of text clustering scheme
is to minimize intra-cluster distances between documents,
while maximizing inter-cluster distances [12]. It is the most
common form of unsupervised learning and it is an efficient
way for sorting several documents to assist users to shift,
summarize, and arrange text documents [4], [24], [14].
In this paper, an effective approach for frequent itemset-
based text clustering using zoning and ranking is proposed.
First, the text documents in the document set are preprocessed.
Then, top-p frequent words are extracted from each document
and hence, the binary mapped database is formed through the
use of these extracted words. Then, the Apriori algorithm is
applied to discover the frequent itemsets having different
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
208 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
length. For every length, the mined frequent itemsets are
sorted in descending order based on their support level.
Subsequently, the documents are split into partition using
sorted frequent itemsets. Furthermore, the resultant clusters
are formed within the partition using the derived keywords.
II. PROPOSED APPROACH
Text mining is an increasingly important research field
because of the necessity of obtaining knowledge from
enormous number of unstructured text documents [23]. Text
clustering is one of the fundamental functions in text mining.
It is to group a collection of documents into different category
groups so that documents in the same category group describe
the same subject. Many researchers [6], [7], [16], [17], [19],
[24], [25] investigated possible ways to improve the
performance of text or document clustering based on the
popular clustering algorithms (partitional and hierarchical
clustering) and frequent term based clustering. In the current
work, an effective approach for clustering a text corpus with
the help of frequent itemsets is proposed.
A. Algorithm: Text Clustering Process.
The effective approach for clustering a text corpus with the
help of frequent itemsets is furnished below:
1. Collect the set of documents i.e. D = {d1, d2, d3, . ., dn}
to make clusters.
2. Apply the Text preprocessing method on D.
3. Create the Binary database B.
4. Mine the Frequent Itemsets using Apriori algorithm on B.
5. Organize the output of first stage of Apriori in sets of
frequent Itemsets of different length.
6. Partition the text documents based on Frequent Itemsets.
7. Cluster text documents within the zone based on their rank.
8. Output the resultant clusters.

The devised approach consists of the following major steps:
(1) Text PreProcessing
(2) Mining of Frequent Itemsets
(3) Partitioning the text documents based on frequent
itemsets
(4) Clustering of text documents within the partition
The steps of the algorithm are explained in detail below:
B. Text PreProcessing.
Let D be the text documents representing a
set n d d d d D
n
s s = i 1 }; ..... {
3 2 1
, where, n is the number
documents in the text dataset D . The text document set D is
converted from unstructured format into some common
representation using the text preprocessing techniques in
which, words/terms are extracted (tokenization) and the input
data set D (text documents) are preprocessed using the
techniques namely, removing stop words and stemming
algorithm.
1) Stop Word Removal: It is the process of removing non-
information bearing words from the documents to reduce
noise as well as to save huge amount of space and thus to
make the later processing more effective and efficient. Stop-
words are dependent on natural language [20].
Stop Words for Reuters-21578: a, b, c, d, e, f, g, h, i, j, k, l,
m, n, o, p, q, r, s, t, u, v, w, x, y, z, that, the, these, this, those,
who, whom, what, where, which, why, of, is, are, when, will,
was, were, be, as, their, been, have, has, had, from, may,
might, there, should, their, it, its, it's, find, out, with, the,
native, status, all, live, in, who, me, get, who, who’s, whom,
the, this, there, is, at, was, or, are, then, that, when, why, what,
want, have, had, has, and, an, you, our, on, of, with, for, can,
to, be, used, all, they, from, so, as, in, if, where, into, by, were,
more, about, said, talk, my, mine, me, you, your, yours, we, us,
our, ours, he, she, it, her, him, his, they, them, their, there.
2) Stemming Algorithm: A stemming algorithm is a
computational procedure that reduces all words with the same
root to a common form, by stripping each word of its
derivational and inflectional suffixes.
The approach to stemming employed here involves a two
phase stemming system. The first phase of the stemming
algorithm ‘proper’ retrieves the stem of a word by removing
its longest possible ending which matches one on a list stored
in the computer. The second phase handles “spelling
exceptions [18].
C. Mining of Frequent ItemSets.
This section describes the mining of frequent itemsets from
the preprocessed text document set D . For every document
i
d ,
the frequency of the extracted words/terms from the
preprocessing step is computed and top- p frequent words
from each document
i
d are taken.
} ; ) d ( p | {
i
D d d K
i i w
_ ¬ =
where, p j T d p
j
w i
1 ; ) ( s s =
From the set of top- p frequent words, the binary
database B is formed by obtaining the unique words. Let
T
B
be a binary database consisting of n number of transactions
T (number of documents) and q number of attributes (unique
words) ] ,....., , [
2 1 q
u u u U = . Binary database
T
B consists of
binary data that represents whether the unique words are
present in the documents
i
d or not.
¦
¹
¦
´
¦
e
e
=
d u if 1
d u if 0
i j
i j
T
B n j s s s s i 1 , q 1 ;
Then, the binary database
T
B is fed to the Apriori algorithm
for mining the frequent itemsets (words/terms)
s
F .
1) Apriori Algorithm: Apriori is a traditional algorithm for
mining association rules that was first introduced in [2]. There
are two steps used for mining association rules: (1) Finding
frequent or large itemsets (2) Generating association rules
from the frequent itemsets. Frequent itemsets can be generated
in two steps. Firstly, candidate itemsets are generated and
secondly frequent itemsets are mined using these candidate
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
209 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
itemsets. The itemsets whose support is greater than the
minimum support given by the user are referred as, ‘frequent
itemsets’. In the proposed approach, only the frequent itemsets
are used for further processing so, only the first step
(generation of frequent itemsets) of the Apriori algorithm is
performed. The pseudo code for the Apriori algorithm [1] is,
}; 1 arg {
1
itemsets e l I ÷ =
begin do ) k ; 0 I 2; (k
1 - k
+ + = = for
candidates New // ); (
1 ÷
÷ =
k k
I gen apriori C
begin do D T ions transact e forall
T in contained Candidates // ); , ( T C subset C
k T
=
do C c candidates
T
e forall
; . + + count c
end
end
sup} min . | { > e = count c C c I
k k

end

D. Partitioning the Text Documents based on Frequent
ItemSets.
This section describes the partitioning of text documents D
based on the mined frequent itemsets F . ‘Frequent Itemset’ is
a set of words that occur together in some minimum fraction
of documents in a cluster. The Apriori algorithm generates a
set of frequent itemsets with varying length ( l ) from 1 to k .
First, the set of frequent itemsets of each length ( l ) are sorted
in descending order according to their support level.
k l f f f f F
k s
s s = 1 ; } .... {
3 2 1

} 1 ; {
) (
t i f f
i l l
s s =
where, ) sup( ... ) sup( ) sup(
) ( ) 2 ( ) 1 ( t l l l
f f f > > > and t denotes
the number of frequent itemsets in the set
l
f .
From the sorted list
) 2 / (k
f , the first element of frequent
itemsets ( ) 1 (
) 2 / (k
f ) is selected and thereby, an initial
partition
1
c containing all the documents having this
itemset ) 1 (
) 2 / (k
f is constructed. Then, the second
element ) 2 (
) 2 / (k
f , whose support less than ) 1 (
) 2 / (k
f is
taken to form a new partition
2
c . This new partition
2
c is
formed by identifying all the documents having large itemset
) 2 (
) 2 / (k
f and considering all the documents that are in the
initial partition
1
c . This procedure is repeated until every text
documents in the input dataset D are moved into a
partition
) (i
C . Furthermore, if the above procedure is not
terminated with the sorted list
) 2 / (k
f , then the subsequent
sorted lists (
) 1 ) 2 / (( ÷ k
f ,
) 2 ) 2 / (( ÷ k
f etc.. ) are taken for
performing the above discussed step. This results into a set of
partition c and each partition
) (i
C contains a set
documents
) (
) (
x
i c
D .
k l m i f c c c
i l i i
s s s s e = 1 , 1 ; } | {
) ( ) ( ) (

] [
) ( ) ( i l i
f Doc C = ;
} 1 , D ; {
) (
) (
) (
) ( ) (
r x D D C
x
i c
x
i c i
s s e =
where, m denotes the number of partition and r denotes the
number of documents in each partition.
For constructing initial partition (or cluster), the mined
frequent itemset that significantly reduces the dimensionality
of the text document set is used and so the clustering with
reduced dimensionality is considerably more efficient and
scalable. Some of the researchers [15], [22] generated the
overlapped of clusters in accordance with the frequent
itemsets and then removed the overlapping documents. In the
proposed research, the non-overlapping partitions are
generated directly from the frequent itemsets. This makes the
initial partitions disjoint because the proposed approach keeps
the document only within the best initial partition.
E. Clustering the Text Documents within the Partition.
In this section, the process of clustering the set of partitions
obtained from the previous step is described. This step is
necessary to form a sub cluster (describing sub-topic) of the
partition (describing same topic) and the resulting cluster can
detect the outlier documents significantly. ‘Outlier document’
in a partition is defined as a document that is different from
the remaining documents in the partition. Furthermore, the
proposed approach does not require a pre-specified number of
clusters. The devised procedure for clustering the text
documents available in the set of partition c is discussed
below:
In this phase, first the documents
) (
) (
x
i c
D and the familiar
words
c(i)
f (frequent itemset used for constructing the
partition) of each partition
) (i
C are identified. Then, the
derived keywords ] [
) (
) (
x
i c d
D K of document
) (
) (
x
i c
D are obtained
by taking the absolute complement of familiar words
c(i)
f
with respect to the top- p frequent words of the
document
) (
) (
x
i c
D .

1 , p j 1 , 1
, ; } f \ { ] [
) (
) ( c(i)
) (
) (
r x m i
D T T D K
x
i c w w
x
i c d
j j
s s s s s s
e =

} f | { f \
c(i) c(i)
e e = x T x T
j j
w w

The set of unique derived keywords of each partition
) (i
C
are obtained and the support of each unique derived keyword
is computed within the partition. The set of keywords
;

k
k
I Answer =
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
210 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
satisfying the cluster support ( sup _ cl ) are formed as
representative words of the partition
) (i
C . ‘The cluster
support’ of a keyword in
) (i
C is the percentage of the
documents in
) (i
C that contains the keyword.
} ) ( : { ] ) ( [ x p x i c R
w
=
where, sup _ ] ] [ [ ) (
) (
) (
cl D K x p
x
i c d
> =
Subsequently, the similarity of the documents
) (
) (
x
i c
D with
respect to the representative words ] ) ( [ i c R
w
is found. The
definition of the similarity measure of text documents is
strictly important for obtaining effective and meaningful
clusters. The similarity measure of each document
m
S is
computed as follows:
( ) )] ( [ ] [ ] ) ( [ , ] [
) (
) (
) (
) (
i c R D K i c R D K S
w
x
i c d w
x
i c d
 =
( )
( )
] ) ( [
] ) ( [ , ] [
] ) ( [ , ] [
) (
) ( ) (
) (
i c R
i c R D K S
i c R D K S
w
w
x
i c d
w
x
i c d m
=

The documents within the partition are sorted according to
their similarity measure and the documents form a new cluster
separately, when the similarity measure exceeds the minimum
threshold.
III. EXPERIMENTAL RESULTS AND PERFORMANCE
EVALUATION
The proposed approach is implemented using Java (JDK
1.6). This implementation of the proposed algorithm is applied
on the text dataset that is collected from Reuters-21578 text
database. The dataset consists of 21,578 different text
documents, which is used mainly by researchers. Out of these
many documents, 100 sample documents are taken to evaluate
the developed algorithm. The performance of the proposed
approach is evaluated on these 100 text documents of Reuters-
21578 [21] using Precision, Recall & F-Measure.
Reuters-21578 Text Database: The documents in the
Reuters-21578 collection appeared on the Reuters newswire in
1987. The documents were assembled and indexed with
categories, by personnel from Reuters Ltd. and Carnegie
Group, Inc. in 1987. In 1990, the documents were made
available by Reuters and CGI for research purposes. Further
formatting and data file production was done in 1991 and
1992 by David D. Lewis and Peter Shoemaker. Steve Finch
and David D. Lewis did cleanup of the collection in1996. The
new collection has only 21,578 documents, and thus the name
Reuters-21578 collection.
A. Experimental Results.
For experimental results, 100 documents are taken from
various topics and the top 10 words are extracted from each
document. Then, a binary database is constructed with 452
attributes. The frequent itemsets are mined from the binary
database and these itemsets are sorted based on their support
level. Thus, 31 frequent itemsets of varying length are
obtained. Then, the initial partition is constructed using these
frequent itemsets, as shown in Table I. After that, a
representative of each partition is computed based on both the
top 10 and familiar words of the partition. The similarity
measure is calculated for each document in the partition, as
shown in Table II. The resultant cluster is formed, only if the
similarity value of the documents within the partition is below
0.4. So, finally 19 clusters are obtained from 14 partitions, as
shown in Table III.
TABLE I
GENERATED PARTITIONS OF TEXT DOCUMENTS
Partition Text Documents
P1
d2, d3, d4, d5, d6, d7, d8, d9, d10, d11,
d12, d13, d58, d64, d65, d66, d67, d68,
d69, d72, d76, d77
P2
d14, d16, d36, d42, d43, d44, d45, d46,
d49, d50 , d60, d73, d75, d85, d90, d93,
d98, d100
P3
d39, d40, d41, d47, d48, d51, d78, d82,
d83, d88
P4
d26, d27, d28, d29, d31, d33, d35, d37,
d57, d95, d96
P5
d19, d55, d62
P6
d56, d63, d70, d71, d74, d80, d81, d87,
d89
P7
d17, d18, d20, d22, d23, d24, d30, d32
P8
d25, d34, d38
P9
d79
P10
d15, d91, d92, d94, d97
P11
d21
P12
d1
P13
d52, d53, d54, d61
P14
d59, d84, d86
TABLE III
SIMILARITY MEASURE OF TEXT DOCUMENTS
Partition
Text Document
(Similarity Measure)
P
1

d
2
(0.125), d
3
(0.25), d
4
(0.125), d
5
(0.125),
d
6
(0.125), d
7
(0.125), d
8
(0.125), d
9
(0.25),
d
10
(0.125), d
11
(0.25), d
12
(0.125),
d
13
(0.125), d
58
(0.0), d
64
(0.5), d
65
(0.5),
d
66
(0.625), d
67
(0.375), d
68
(0.5), d
69
(0.625),
d
72
(0.0), d
76
(0.375), d
77
(0.25)

P
2

d
14
(0.3333), d
16
(0.0), d
36
(0.1666),
d
42
(0.3333), d
43
(0.5), d
44
(0.1666),
d
45
(0.3333), d
46
(0.5), d
49
(0.5), d
50
(0.6666),
d
60
(0.0), d
73
(0.0), d
75
(0.0), d
85
(0.0),
d
90
(0.3333), d
93
(0.3333), d
98
(0.1666),
d
100
(0.1666)
P
3

d
39
(0.3846), d
40
(0.5385), d
41
(0.5385),
d
47
(0.5385), d
48
(0.3846), d
51
(0.3077),
d
78
(0.2308), d
82
(0.1538), d
83
(0.0),
d
88
(0.2308)
P
4

d
26
(0.3333), d
27
(0.6666), d
28
(0.4444),
d
29
(0.4444), d
31
(0.5555), d
33
(0.6666),
d
35
(0.5555), d
37
(0.7777), d
57
(0.1111),
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
211 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
d
95
(0.2222), d
96
(0.0)
P
5
d
19
(0.36), d
55
(0.36), d
62
(0.36)
P
6

d
56
(0.2272), d
63
(0.1818), d
70
(0.2727),
d
71
(0.0909), d
74
(0.3182), d
80
(0.3636),
d
81
(0.3182), d
87
(0.3182), d
89
(0.2727)
P
7

d
17
(0.4737), d
18
(0.4737), d
20
(0.4211),
d
22
(0.3684), d
23
(0.3684), d
24
(0.4211),
d
30
(0.1579), d
32
(0.1579)
P
8
d
25
(0.375), d
34
(0.375), d
38
(0.375)
P
9
d
79
(1.0)
P
10

d
15
(0.2093), d
91
(0.2093), d
92
(0.2093),
d
94
(0.2093), d
97
(0.2093)
P
11
d
21
(1.0)
P
12
d
1
(1.0)
P
13
d
52
(0.25), d
53
(0.25), d
54
(0.25), d
61
(0.25)
P
14
d
59
(0.4211), d
84
(0.5385), d
86
(0.5385)
TABLE IIIII
RESULTANT CLUSTERS
Partition Clusters Text Documents
P
1

C
1

d
64
, d
65
, d
66
, d
68
, d
69

C
2

d
2
, d
3
, d
4
, d
5
, d
6
, d
7
, d
8
, d
9
, d
10
,
d
11
, d
12
, d
13
, d
58
, d
67
, d
72
, d
76
, d
77
P
2

C
3

d
14
, d
16
, d
36
, d
42
, d
44
, d
45
, d
60
, d
73
,
d
75
, d
85
, d
90
, d
93
, d
98
, d
100

C
4

d
43
, d
46
, d
49
, d
50

P
3

C
5

d
39
, d
48
, d
51
, d
78
, d
82
, d
83
, d
88

C
6

d
40
, d
41
, d
47

P
4

C
7

d
27
, d
28
, d
29
, d
31
, d
33
, d
35
, d
37

C
8

d
26
, d
57
, d
95,
d
96

P
5
C
9

d
19
, d
55
, d
62

P
6
C
10

d
56
, d
63
, d
70
, d
71
, d
74
, d
80
, d
81
, d
87
,
d
89

P
7

C
11

d
17
, d
18
, d
20
, d
24

C
12

d
22
, d
23
, d
30
, d
32

P
8
C
13

d
25
, d
34
, d
38

P
9
C
14

d
79

P
10
C
15

d
15
, d
91
, d
92
, d
94
, d
97

P
11
C
16

d
21

P
12
C
17

d
1

P
13
C
18

d
52
, d
53
, d
54
, d
61

P
14
C
19

d
59
, d
84
, d
86

B. Performance Evaluation.
Evaluation metrics namely, Precision, Recall and F-measure
are used for evaluating the performance of the proposed
approach. The definitions of the evaluation metrics are given
below:

j ij
/C C j) (i, Precision =


) , ( Recall ) , ( Precision
) , ( Precision * ) , ( Recall * 2
) , (
j i j i
j i j i
j i MeasureF
+
= ÷ F
where ,
ij
C is the number of members of topic i in cluster j ,
j
C is the number of members of cluster j , and
i
C is the number of members of topic i .
In order to evaluate the proposed approach on Reuters-
21578 database, 100 documents are taken from 8 different
topics (acq, cocoa, coffee, cpi, crude, earn, money-fx, trade).
The proposed approach uses these documents as input text and
finally resulted in 19 clusters. For each cluster, the precision,
Recall and F-Measure are computed with the help of the
above mentioned definitions. The obtained results are shown
in Table IV.
TABLE IVV
CLUSTERING PERFORMANCES OF TEXT DOCUMENTS
Partition Cluster Precision Recall F-measure
P
1

C
1
1.0 0.36 0.53
C
2
0.71 0.92

0.8

P
2

C
3
1.0 0.31 0.47
C
4
0.33 0.45 0.38
P
3

C
5
1.0 0.23 0.37
C
6
0.57 0.33 0.42
P
4

C
7
1.0 0.5 0.67
C
8
0.5 0.18 0.26
P
5
C
9
1.0 0.25 0.4
P
6
C
10
0.44 0.25 0.32
P
7

C
11
1.0 0.36 0.53
C
12
0.5 0.18 0.26
P
8
C
13
1.0 0.21 0.35
P
9
C
14
1.0 0.08 0.15
P
10
C
15
0.8 0.36 0.5
P
11
C
16
1.0 0.09 0.17
P
12
C
17
1.0 0.08 0.15
P
13
C
18
1.0 0.33 0.5
P
14
C
19
0.66 0.17 0.27
IV. CONCLUSION
i ij
C C j i / ) , ( Recall =
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
212 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Text Clustering is one of the important techniques of text
mining. The use of frequent association for text clustering has
received a great deal of attention in research communities
since the mined frequent itemsets reduces the dimensionality
of the documents drastically.
In this paper, an effective approach for text clustering is
developed in accordance with the frequent itemsets. In the
proposed work, initially, the text documents are preprocessed
and subsequently, the Apriori algorithm is applied to discover
the frequent itemsets having different length. Consequently,
the documents are split into partitions using the sorted
frequent itemsets. Furthermore, the resultant clusters are
formed within the partition using the derived keywords. Real
life dataset Reuters-21578 is used for analysing the frequent
itemset based text clustering approach. In addition to this,
evaluation metrics: Precision, Recall and F-Measure are used
for evaluating the performance. High Precision indicates the
effectiveness of the proposed frequent-item based text
clustering approach. Furthermore, the proposed approach does
not require a pre-specified number of clusters.
In conclusion, the importance of document clustering will
continue to grow along with the massive volumes of
unstructured data generated. Exploiting an effective and
efficient method in document clustering would be an essential
direction for research in text mining, especially text clustering.
V. FUTURE WORK
Future study on text clustering using frequent items has the
following possible avenues:
- In the proposed approach, Apriori algorithm is used to find
out the frequent item sets in the data set, in which the
database is to be scanned repeatedly if the number of
frequent 1-itemsets is high or if the size of the frequent
pattern is big. A possible direction for future research is to
make use of FP-Growth algorithm in the place of Apriori
algorithm so that best frequent itemsets can be identified
and FP-Growth is faster.
- The proposed approach does not take Outliers into
consideration. As a part of future work; Outliers can also be
handled.
- The current implementation is inappropriate for preserving
the clusters in dynamic environment. So, another possible
research direction is to develop an incremental clustering
approach, which makes use of frequent item sets, in order to
avoid the complete re-clustering of entire database each
time when a change is made in the database [5], [8], [10].
REFERENCES
[1] Agrawal R and Srikant R, “Fast algorithms for mining association rules”,
In Proceedings of 20th International Conference on Very Large Data
Bases, Santiago, Chile, pp. 487–499, September 1994.
[2] Agrawal R, Imielinski T and Swami A, “Mining association rules
between sets of items in large databases”, In proceedings of the
international Conference on Management of Data, ACM SIGMOD, pp.
207–216, Washington, DC, May 1993.
[3] Bjornar Larsen and Chinatsu Aone, “Fast and Effective Text Mining
Using Linear-time Document Clustering”, in Proceedings of the fifth
ACM SIGKDD international conference on Knowledge discovery and
data mining, San Diego, California, United States , pp. 16 – 22, 1999.
[4] Congnan Luo, Yanjun Li and Soon M. Chung, "Text document
clustering based on neighbors", Data & Knowledge Engineering, Vol:
68, No: 11, pp: 1271-1288, November 2009.
[5] Domingos P and Hulten G. Mining High-Speed Data Streams. In
Knowledge Discovery and Data Mining, pages 71–80, 2000.
[6] Feldman R., Sanger J., “The Text Mining Handbook”, Cambridge
University Press, 2007.
[7] Florian Beil, Martin Ester, Xiaowei Xu “Frequent Term-based Text
Clustering”, In KDD '02: Proceedings of the eighth ACM SIGKDD
International conference on Knowledge discovery and data mining
(2002), pp. 436-442. doi:10.1145/775047.775110.
[8] Guha S, Mishra N, Motwani R, and Callaghan L. O, Clustering Data
Streams. In IEEE Symposium on Foundations of Computer Science,
pages 359–366, 2000.
[9] Haralampos Karanikas, Christos Tjortjis, Babis Theodoulidis, “An
Approach to Text Mining using Information Extraction”, Proc.
Knowledge Management Theory Applications Workshop, (KMTA
2000), Lyon, France, pp. 165-178, Sep 2000.
[10] Hulten G, Spencer L, and Domingos P, Mining time-changing data
streams. In Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 97–106,
San Francisco, CA, 2001. ACM Press.
[11] Ah-Hwee Tan, (1999), Text Mining: The state of art and the challenges,
In proceedings, PAKDD'99 Workshop on Knowledge discovery from
Advanced Databases (KDAD'99), Beijing, pp. 71-76, April 1999.
[12] Jain A K and Dubes R C, “Algorithms for Clustering Data”, Prentice
Hall, Englewood Cliffs, 1988.
[13] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and
Techniques”, 2006 (c) Morgan Kaufmann Publishers.
[14] Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: A Review”,
ACM Computing Surveys, Vol: 31, No: 3, pp: 264-323. 1999.
[15] Law M.H.C., Figueiredo M.A.T., Jain A.K., “Simultaneous feature
selection and clustering using mixture models”, IEEE Transaction on
Pattern Analysis and Machine Intelligence, 26(9), pp. 1154-1166, 2004.
[16] W.-L. Liu and X.-S. Zheng, "Documents Clustering based on Frequent
Term Sets", Intelligent Systems and Control, 2005.
[17] Le Wang, Li Tian, Yan Jia, Weihong Han, “A Hybrid Algorithm For
Web Document Clustering Based On Frequent Term Sets And K-
Means”, Advances in Web and Network Technologies and Information
Management, Lecture Notes in Computer Science, 2007, Vol.
4537/2007, pp. 198-203, DOI: 10.1007/978-3-540-72909-9_20.
[18] Lovins, J.B. 1968: "Development of a stemming algorithm", Mechanical
Translation and Computational Linguistics, vol. 11, pp. 22-31, 1968.
[19] Murali Krishna S., Durga Bhavani S., “An Efficient Approach For Text
Clustering Based On Frequent Itemsets”, ©Euro Journals Publishing,
Inc. 2010, European Journal of Scientific Research ISSN 1450-216X,
Vol.42, n 3, pp. 399-410, 2010.
[20] Pant. G., Srinivasan. P and Menczer, F., "Crawling the Web". Web
Dynamics: Adapting to Change in Content, Size, Topology and Use,
edited by M. Levene and A. Poulovassilis, Springer- verilog, pp: 153-
178, November 2004.
[21] Reuters-21578, Text Categorization Collection, UCI KDD Archive.
[22] Shenzhi Li, Tianhao Wu, William M. Pottenger, “Distributed Higher
Order Association Rule Mining Using Information Extracted from
Textual Data”, ACM SIGKDD Explorations Newsletter, Natural
language processing and text mining, Vol. 7, n 1, pp. 26-35, 2005.
[23] Un Yong Nahm, Raymond J Mooney, “Text mining with information
extraction”,CM, pp. 218, 2004.
[24] Xiangwei Liu, Pilian, “A Study On Text Clustering Algorithms Based
On Frequent Term Sets”, Advanced Data Mining and Applications,
Lecture Notes in Computer Science, 2005, Vol. 3584/2005, pp. 347-354,
DOI: 10.1007/11527503_42.
[25] Zhou Chong, Lu Yansheng, Zou Lei, Hu Rong, “FICW: Frequent
Itemset Based Text Clustering with Window Constraint”, Vol. 11, n 5,
pp. 1345-1351, 2006, DOI: 10, 1007/BF02829264.





(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
213 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
About the Authors

S.SUNEETHA pursued her
Bachelor’s Degree in
Science and in Education,
Master’s Degree in
Computer Applications
(MCA) from Sri
Venkateswara University,
Tirupati, Andhra Pradesh,
India. She completed her
M.Phil. in Computer
Science from Sri Padmavati
Mahila Visvavidyalayam,
Tirupati. She presented and published papers in
International and National Conferences. Her main
research interests include, Data Mining and Software
Engineering. She is a life member of ISTE. She served
Narayana Engineering College, Nellore, Andhra Pradesh
as Sr. Asst. Professor, heading the departments of IT and
MCA


Dr. M. Usha Rani is an
Associate Professor in the
Department of Computer
Science and HOD for CSE&IT,
Sri Padmavathi Mahila
Viswavidyalayam(SPMVV
Womens’ University), Tirupati.
She did her Ph.D. in Computer
Science in the area of Artificial
Intelligence and Expert
Systems. She is in teaching
since 1992. She presented
more than 34 papers at National and Internal
Conferences and published 19 articles in national &
international journals. She also written 4 books like
Data Mining - Applications: Opportunities and
Challenges, Superficial Overview of Data Mining Tools,
Data Warehousing & Data Mining and Intelligent
Systems & Communications. She is guiding M.Phil and
Ph.D. in the areas like Artificial Intelligence,
DataWarehousing and Data Mining, Computer
Networks and Network Security etc.










Mr.YaswanthKumar .Avu
lapati received his MCA
degree with First class
from Sri Venkateswara
University, Tirupati. He
received his M.Tech
Computer Science and
Engineering degree with
Distinction from Acharya
Nagarjuna University,
Guntur.He is a research scholar in S.V.University
Tirupati, Andhra Pradesh.He has presented number of
papers in national and international conferences,
seminars.He attend Number of workshops in different
fields.


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 6, June 2011
214 http://sites.google.com/site/ijcsis/
ISSN 1947-5500

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->