Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Text Clustering Based on Frequent Items Using Zoning and Ranking

Text Clustering Based on Frequent Items Using Zoning and Ranking

Ratings: (0)|Views: 779|Likes:
Published by ijcsis
In today’s information age, there is an incredible nonstop growth in the textual information available in electronic form. This increasing textual data has led to the task of mining useful or interesting frequent itemsets (words/terms) from very large unstructured text databases and this task still seems to be quite challenging. The use of such frequent association for text clustering has received a great deal of attention in research communities since the mined frequent itemsets reduces the dimensionality of the documents drastically. In this work, an effective approach for text clustering is developed in accordance with the frequent itemsets that provides significant dimensionality reduction. Here, Apriori algorithm, a well-known method for mining the frequent itemsets is used. Then, a set of non-overlapping partitions are obtained using these frequent itemsets and the resultant clusters are generated within the partition for the document collections. An extensive analysis of frequent item-based text clustering approach is conducted with a real life text dataset, Reuters-21578. The experimental results of the frequent item-based text clustering approach for 100 documents of Reuters-21578 dataset are given, and the performance of the same has been evaluated with Precision, Recall and F-measure. The results ensured that the performance of the proposed approach improved effectively. Thus, this approach effectively groups the documents into clusters and mostly, it provides better precision for dataset taken for experimentation.
In today’s information age, there is an incredible nonstop growth in the textual information available in electronic form. This increasing textual data has led to the task of mining useful or interesting frequent itemsets (words/terms) from very large unstructured text databases and this task still seems to be quite challenging. The use of such frequent association for text clustering has received a great deal of attention in research communities since the mined frequent itemsets reduces the dimensionality of the documents drastically. In this work, an effective approach for text clustering is developed in accordance with the frequent itemsets that provides significant dimensionality reduction. Here, Apriori algorithm, a well-known method for mining the frequent itemsets is used. Then, a set of non-overlapping partitions are obtained using these frequent itemsets and the resultant clusters are generated within the partition for the document collections. An extensive analysis of frequent item-based text clustering approach is conducted with a real life text dataset, Reuters-21578. The experimental results of the frequent item-based text clustering approach for 100 documents of Reuters-21578 dataset are given, and the performance of the same has been evaluated with Precision, Recall and F-measure. The results ensured that the performance of the proposed approach improved effectively. Thus, this approach effectively groups the documents into clusters and mostly, it provides better precision for dataset taken for experimentation.

More info:

Published by: ijcsis on Jul 07, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

07/07/2011

pdf

text

original

 
Text Clustering Based on Frequent Items UsingZoning and Ranking
S. Suneetha
1
, Dr. M. Usha Rani
2
, Yaswanth Kumar.Avulapati
3
 
Research Scholar
1
, Associate Professor
2
,Department of Computer Science, SPMVV, Tirupati.Research Scholar
3
, Dept of Computer Science, S.V.University, Tirupati
 
suneethanaresh@yahoo.com
1
,musha_rohan@yahoo.com
2
, Yaswanthkumar_1817@yahoo.co.in
3
 Abstract
In today’s information age, there is an incrediblenonstop growth in the textual information available in electronicform. This increasing textual data has led to the task of mininguseful or interesting frequent itemsets (words/terms) from verylarge unstructured text databases and this task still seems to bequite challenging. The use of such frequent association for textclustering has received a great deal of attention in researchcommunities since the mined frequent itemsets reduces thedimensionality of the documents drastically. In this work, aneffective approach for text clustering is developed in accordancewith the frequent itemsets that provides significantdimensionality reduction. Here, Apriori algorithm, a well-knownmethod for mining the frequent itemsets is used. Then, a set of non-overlapping partitions are obtained using these frequentitemsets and the resultant clusters are generated within thepartition for the document collections. An extensive analysis of frequent item-based text clustering approach is conducted with areal life text dataset, Reuters-21578. The experimental results of the frequent item-based text clustering approach for 100documents of Reuters-21578 dataset are given, and theperformance of the same has been evaluated with Precision,Recall and F-measure. The results ensured that the performanceof the proposed approach improved effectively. Thus, thisapproach effectively groups the documents into clusters andmostly, it provides better precision for dataset taken forexperimentation.
 Keywords
Text Mining, Text Clustering, Text Documents,Frequent Itemsets, Apriori Algorithm, Reuters-21578.
I.
INTRODUCTION
 The current age is referred to as the “Information Age”. Inthis information age, information leads to power and success,only if one can “Get the Right Information, To the RightPeople, At the Right Time, On the Right Medium, In the RightLanguage, With the Right Level of Detail”. The abundance of data, coupled with the need for powerful data analysis tools isdescribed as “Data Rich but Information Poor” situation. Inorder to relieve such a data rich but information poor dilemma,a new discipline named data mining emerged, which devotesitself to extracting knowledge from huge volumes of data,with the help of the ubiquitous modern computing devices.The term “Data Mining” also known as Knowledge Discoveryin Databases (KDD) is formally defined as: “the non-trivialextraction of implicit, previously unknown, and potentiallyuseful information from large amounts of data” [13]. Datamining is not specific to one type of media or data. It isapplicable to any kind of information repository.Generally, data mining is performed on data represented inquantitative, textual, or multimedia forms. In recent times,there is an increasing flood of unstructured textual information.The area of text mining is growing rapidly mainly becauseof the strong need for analyzing this vast amount of textualdata. As the most natural form of storing and exchanginginformation is written words, text mining has a very highcommercial potential [9], [11]. So, it is regarded as the nextwave of knowledge discovery. Traditional document and textmanagement tools are inadequate to meet these utilities.Document management systems work well with homogeneousdocuments but not with the heterogeneous mix. Even the bestinternet search tools suffer from poor precision and recall. Theability to distil this untapped source of information, free textdocument, provides substantial competitive advantages tosucceed in the era of a knowledge-based economy. Thus, TextMining provides a competitive edge for a company to processand take advantage of massive textual information.Text Mining, also known as Text Data Mining orKnowledge Discovery from Textual Databases, is defined as,“the nontrivial extraction of implicit, previously unknown,and potentially useful information from textual data” [3] Or“the process of extracting interesting and non-trivial patternsor knowledge from unstructured text documents”. 'HighQuality' in text mining refers to some combination of relevance, novelty, and interestingness [6].‘Text Clustering’ or ‘Document Clustering’ is ‘theorganization of a collection of text documents into clustersbased on similarity. Intuitively, documents within a validcluster are more similar to each other than those belonging toa different cluster’. In other words, documents in one clustershare similar topics. Thus, the goal of text clustering schemeis to minimize intra-cluster distances between documents,while maximizing inter-cluster distances [12]. It is the mostcommon form of unsupervised learning and it is an efficientway for sorting several documents to assist users to shift,summarize, and arrange text documents [4], [24], [14].In this paper, an effective approach for frequent itemset-based text clustering using zoning and ranking is proposed.First, the text documents in the document set are preprocessed.Then, top-
 p
frequent words are extracted from each documentand hence, the binary mapped database is formed through theuse of these extracted words. Then, the Apriori algorithm isapplied to discover the frequent itemsets having different
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011208http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
length. For every length, the mined frequent itemsets aresorted in descending order based on their support level.Subsequently, the documents are split into partition usingsorted frequent itemsets. Furthermore, the resultant clustersare formed within the partition using the derived keywords.II.
 
PROPOSED
 
APPROACHText mining is an increasingly important research fieldbecause of the necessity of obtaining knowledge fromenormous number of unstructured text documents [23]. Textclustering is one of the fundamental functions in text mining.It is to group a collection of documents into different categorygroups so that documents in the same category group describethe same subject. Many researchers [6], [7], [16], [17], [19],[24], [25] investigated possible ways to improve theperformance of text or document clustering based on thepopular clustering algorithms (partitional and hierarchicalclustering) and frequent term based clustering. In the currentwork, an effective approach for clustering a text corpus withthe help of frequent itemsets is proposed.
 A.
 
 Algorithm: Text Clustering Process.
The effective approach for clustering a text corpus with thehelp of frequent itemsets is furnished below:
 
1. Collect the set of documents i.e. D = {d1, d2, d3, . ., dn}to make clusters.2. Apply the Text preprocessing method on D.3. Create the Binary database B.4. Mine the Frequent Itemsets using Apriori algorithm on B.5. Organize the output of first stage of Apriori in sets of frequent Itemsets of different length.6. Partition the text documents based on Frequent Itemsets.7. Cluster text documents within the zone based on their rank.8. Output the resultant clusters.The devised approach consists of the following major steps:(1)
 
Text PreProcessing(2)
 
Mining of Frequent Itemsets(3)
 
Partitioning the text documents based on frequentitemsets(4) Clustering of text documents within the partition
 
The steps of the algorithm are explained in detail below:
 B. Text PreProcessing.
Let
 D
be the text documents representing aset
n D
n
i1};.....{
321
, where,
n
is the numberdocuments in the text dataset
 D
. The text document set
 D
isconverted from unstructured format into some commonrepresentation using the text preprocessing techniques inwhich, words/terms are extracted (tokenization) and the inputdata set
 D
(text documents) are preprocessed using thetechniques namely, removing stop words and stemmingalgorithm.
1) Stop Word Removal:
It is the process of removing non-information bearing words from the documents to reducenoise as well as to save huge amount of space and thus tomake the later processing more effective and efficient. Stop-words are dependent on natural language [20].Stop Words for Reuters-21578: a, b, c, d, e, f, g, h, i, j, k, l,m, n, o, p, q, r, s, t, u, v, w, x, y, z, that, the, these, this, those,who, whom, what, where, which, why, of, is, are, when, will,was, were, be, as, their, been, have, has, had, from, may,might, there, should, their, it, its, it's, find, out, with, the,native, status, all, live, in, who, me, get, who, who’s, whom,the, this, there, is, at, was, or, are, then, that, when, why, what,want, have, had, has, and, an, you, our, on, of, with, for, can,to, be, used, all, they, from, so, as, in, if, where, into, by, were,more, about, said, talk, my, mine, me, you, your, yours, we, us,our, ours, he, she, it, her, him, his, they, them, their, there.
2) Stemming Algorithm:
A stemming algorithm is acomputational procedure that reduces all words with the sameroot
 
to a common form, by stripping each word of itsderivational and inflectional suffixes.The approach to stemming employed here involves a twophase stemming system. The first phase of the stemmingalgorithm ‘proper’ retrieves the stem of a word by removingits longest possible ending which matches one on a list storedin the computer. The second phase handles “spellingexceptions [18].
C. Mining of Frequent ItemSets.
This section describes the mining of frequent itemsets fromthe preprocessed text document set
 D
. For every document
i
,the frequency of the extracted words/terms from thepreprocessing step is computed and top-
 p
frequent wordsfrom each document
i
are taken.} ; )d(p |{ 
i
 D
iiw
 where,
 p j p
 j
wi
 1 ; )(
 From the set of top-
 p
frequent words, the binarydatabase
 B
is formed by obtaining the unique words. Let
 B
 be a binary database consisting of 
n
number of transactions
(number of documents) and
q
number of attributes (uniquewords)],.....,,[
21
q
uuu
. Binary database
 B
consists of binary data that represents whether the unique words arepresent in the documents
i
or not.
 d u if  1  d u if  0
i ji j
 B
n j
i1 ,q 1 ;
 Then, the binary database
 B
is fed to the Apriori algorithmfor mining the frequent itemsets (words/terms)
s
.
1) Apriori Algorithm:
Apriori is a traditional algorithm formining association rules that was first introduced in [2]. Thereare two steps used for mining association rules: (1) Findingfrequent or large itemsets (2) Generating association rulesfrom the frequent itemsets. Frequent itemsets can be generatedin two steps. Firstly, candidate itemsets are generated andsecondly frequent itemsets are mined using these candidate
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011209http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
itemsets. The itemsets whose support is greater than theminimum support given by the user are referred as, ‘frequentitemsets’.
 
In the proposed approach, only the frequent itemsetsare used for further processing so, only the first step(generation of frequent itemsets) of the Apriori algorithm isperformed. The pseudo code for the Apriori algorithm [1] is,};1 arg{
1
itemsetsel I 
 begin do ) ;0I 2;(k  
1-
 for 
candidates New// );(
1
 I genapriori  
begin do DT ionstransact
 forall
 
Tincontained Candidates// );,(
subset 
 
do Cc candidates 
T
 forall
;.
count c
 
end 
 
end 
 
sup}min.|{
count cc I 
 
end 
  D. Partitioning the Text Documents based on Frequent  ItemSets.
This section describes the partitioning of text documents
 D
 based on the mined frequent itemsets
.
Frequent Itemset’ isa set of words that occur together in some minimum fractionof documents in a cluster. The Apriori algorithm generates aset of frequent itemsets with varying length (
l
) from 1 to
.First, the set of frequent itemsets of each length (
l
) are sortedin descending order according to their support level.
l f  f  f  f 
s
1 ; }.... {
321
 
}1 ; {
)(
i f  f 
ill
 where,)sup(...)sup()sup(
)()2()1(
lll
 f  f  f 
and
denotesthe number of frequent itemsets in the set
l
 f 
.From the sorted list
)2 / (
 f 
, the first element of frequentitemsets (
)1(
)2 / (
 f 
) is selected and thereby, an initialpartition
1
c
containing all the documents having thisitemset
)1(
)2 / (
 f 
is constructed. Then, the secondelement
)2(
)2 / (
 f 
, whose support less than
)1(
)2 / (
 f 
istaken to form a new partition
2
c
. This new partition
2
c
isformed by identifying all the documents having large itemset
)2(
)2 / (
 f 
and considering all the documents that are in theinitial partition
1
c
. This procedure is repeated until every textdocuments in the input dataset
 D
are moved into apartition
)(
i
. Furthermore, if the above procedure is notterminated with the sorted list
)2 / (
 f 
, then the subsequentsorted lists (
)1)2 / ((
 f 
,
)2)2 / ((
 f 
etc.. ) are taken forperforming the above discussed step. This results into a set of partition
c
and each partition
)(
i
contains a setdocuments
 
)()(
 xic
 D
.
lmi f ccc
ilii
1 ,1 ;}| {
)()()(
 ][
)()(
ili
 f  Doc
;
} 1 , D ; {
)()()()()(
 x D D
 xic xici
 where,
m
denotes the number of partition and
denotes thenumber of documents in each partition.For constructing initial partition (or cluster), the minedfrequent itemset that significantly reduces the dimensionalityof the text document set is used and so the clustering withreduced dimensionality is considerably more efficient andscalable. Some of the researchers [15], [22] generated theoverlapped of clusters in accordance with the frequentitemsets and then removed the overlapping documents. In theproposed research, the non-overlapping partitions aregenerated directly from the frequent itemsets. This makes theinitial partitions disjoint because the proposed approach keepsthe document only within the best initial partition.
 E. Clustering the Text Documents within the Partition.
 In this section, the process of clustering the set of partitionsobtained from the previous step is described. This step isnecessary to form a sub cluster (describing sub-topic) of thepartition (describing same topic) and the resulting cluster candetect the outlier documents significantly.
Outlier document’
 in a partition is defined as a document that is different fromthe remaining documents in the partition. Furthermore, theproposed approach does not require a pre-specified number of clusters. The devised procedure for clustering the textdocuments available in the set of partition
c
is discussedbelow:In this phase, first the documents
)()(
 xic
 D
and the familiarwords
c(i)
f(frequent itemset used for constructing thepartition) of each partition
)(
i
are identified. Then, thederived keywords
 ][
)()(
 xic
 D
of document
)()(
 xic
 D
are obtainedby taking the absolute complement of familiar words
c(i)
 with respect to the top-
 p
frequent words of thedocument
)()(
 xic
 D
.
1 , p j1 , 1  , ; }  \ { ][
)()(c(i))()(
 xmi  D D
 xicww xic
 j j
 
}|{ \ 
c(i)c(i)
 x x
 j j
ww
 The set of unique derived keywords of each partition
)(
i
 are obtained and the support of each unique derived keywordis computed within the partition. The set of keywords
;
 I  Answer 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011210http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->