This action might not be possible to undo. Are you sure you want to continue?

Douglass R. Cuttingl David

**A Cluster-based Large Document
**

R. Kargerl’2 Jan O.

Approach Collections

Pedersenl

to

John

W.

Tukey13

Abstract

Document formation two large number not main clustering retrieval categories: (with of documents); improve that these has not tool. first, running and been well received to its often that is too quadratic clustering as an ininto for does in the slow Objections that time second, use fall

of a hierarchy, taking condition returned These neighbor the erate arise only when clustersearch conventional these employs We also which support the the

queries

are The Hybrid

processed branch, subtree strategies

downward, until at that some point of in used

always stopping is then nearof gen-

highest is achieved.

scoring

as a result. strategies searchl clustering. compared document

are also available. variations terms to

clustering

are where

essentially nearness similarity

corpora appreciably

is defined measure search

pairwise

retrieval. problems

Indeed, to direct

cluster

techniques search recall. [9], Variare not in some are often unperfor-

We argue ing tion is used techniques. and provides a document clustering (linear teractive time)

are typically and

near-neighbor and

in an attempt However, a powerful browsing as its primary clustering in its own

to improve right

are evaluated indicate superior can

in terms that

of precision search for

looking

at clustering obviates that

as an informaobjections, We present docum-ent present this fast in-

ous studies markedly situations, Furthermore, slow, mance, for ment with surprising Document

cluster (see, clustering

strategies and,

access tool

to near-neighbor

search, example, algorithms It

new access paradigm. technique operation. algorithms

be inferior document

[6, 12, 4]).

quadratic that cluster gained

running search, wide

times. with popularity.

is therefore

its indifferent

browsing

paradigm.

has not

clustering algorithms in that than

has also been studied search, near-neighbor [1]. approach document a new dismissing but for

as a method the search to develophas de-

1

Introduction

clustering for has been extensively improving an excellent mutually same document review). and, of such similar queries, investigated search The and will that general as reastend aucan

accelerating of fast this interest paper,

near-neighbor possibility

Document trieval sumption tomatic improve ically

creased In clustering. as a poor how own called as its towards serves To

a methodology is that

we take

document clustering we ask in its method, clustering is directed goals and

(see [15] for

Rather tool

documents hence, a search

for enhancing can be effective We describe

near-neighbor as an access a document

search, method browsing

to be relevant recall a fixed

to the

clustering right.

determination

of groups

documents

by effectively corpus

broadening

request Typinto a or into against are

Scatter/Gather, primitive information as a complement implement

which access

uses This with

document technique non-specific

**(see [11] for an exhaustive hierarchical In the case and clusters returned
**

1Xerox 3333 Coyote 2 St anford 3 Princeton Permission granted titla that to

a discussion partition, tree the structure contents of a partition,

of the cluster disjoint (see, of the queries best sorted

hypothesis). either

operation.

of documents

is clustered or otherwise, for example, are scoring by score. matched

to more

focused fast two new

techniques. document near for clusterlinear their time effechas shown

[8, 13, 2]). clusters

Scatter/Gather, We introduce which also and discuss

ing is a necessity. clustering to tiveness. be effective,

algorithms

experimentation reasons

as a result,

Palo Hill Alto

possibly

In the case

Research Palo

Center Alto, CA 94304

Road,

University University

1.1

The

of this material is for

Browsing

standard presumes

need.

vs Search

of the information the user’s

to

formulation a query,

The task

access probof an inforfor doc-

copy that

without

fee

all or part are not made

provided

the copies

or distributed notice

lem

mation

expression

search

direQt .a?mmerGial of the copying specific

advantage,

the ACM

Gopyright

and the is given a fee

is then

a corpus

publication To copy

and its date otherwise,

appaar,

and notice

uments ble,

that

match a situation

this

need. in which

**However, it is hard, precisely.
**

or ‘(similarity”

it is not if not

difficult impossithe

is by permission permission.

of tha Association or to republish,

for Computing requires

to imagine

Machinery, and/or

to formulate

such a query

as “vector space”

For example,

search

15th Ann Int’1 SIGIR ‘92/Denmark-6/92 @ 1992 ACM 0.8979J-52+0/92/0006/03J

1Also known

8,..$1,50

318

process search emphasis of the of topic only spectrum. the table of the text. user. articles the assigned during where posted month the text to the of Au1. choice for to a particular be looking of words. at all. the table of the sort However. synonyms rather to use precisely instead. end is a browstng a need It is common from browsing goal defined the user selects a subcollection. the out groups With presented the groups bot- which is refined as he finds the document techniques where tend clusis sub- each become successive more small collection. need is too vague to be described as also lead to specific alternate One can eas- a single q searching By system Even scribe if a topic it may words to in were available. this a partially more about of document user. used in discussion may may than with fail of the topic one or more can be selected the on further examination. Here. A. The if one on the full descriptions. text. We propose information methods an alternative access. “international if some available. anything the general user Initially is presented of document the system of document summaries collection clusters. two examination of interest.g. specific Indeed. rather covers may wish to discover Access spectrum: a particular defined document across starts the with out information collection is a narrowly given someto in number number short further to form tering number to the summaries. goal. navigating The used pear words Even were those used to describe the topic of interest. to Browsing proposed short the or the browsing method. tool will but can of which concerning international browsing describes be iterated documents. again The of them selected as specific as its title. merged search. results documents. is directly found focused the document that cluster-based. search. at the other collection. to scatter one or more groups The system of the groups then applies into are gathered together for a session search: to move the new groups. to proan clusters immeinvades leads be used being select documents to a more or on terms In particular. method. words. the user ‘Kuwait’ clusters Scatter/Gather. of the seem example. search search similar for user scribe ipate to find user viced used retrieve [7]. issues: clusters.user may not be familiar a topic with the vocabulary or may approprinot wish to 2 In the the Scatter/Gather basic iteration of the with groups.. search browsing a search other the documents e. not which used to deswitch be used help also the be be serthat With vide outline which In the diately Kuwait. queries Germany considers many to focus on international and ‘Germany’ and ‘oil’ are gathered together. of contents. request. formulate to organize too Scatter/Gather of word-based scattering: reunification. tional q the user wants techniques: to find happened that the application of conven- contents answered may ily a sense of what sections of questions by a more between the index. table-of-contents documents. topic month She need only particular by some documents. spectrum. information topic. content specified of the corpus. 5000 Servzce session. may to the of the of interest. may thus used to de- direct with we propose an information our browsing access method. scatters groups. potentially the from big the forced those are Iraq This we antic- the user may. or has a genwhich table might of be and and is provided Several search The as Appendix issues prevent peruses Suppose month.1 An Illustration a Scatter/Gather of about News session figure. the gives that to specific index. to emphasize example tering. analogy. Standard the of this from view information end access become when toms and therefore Ultimately. subcollection which are again iteration detailed. at any time. search groups until Based or snippet discuss articles and events may For need one such or more word-based. application our inspiration with in mind. and The methods. she selects the These three 319 . taking provided question question. out what interested logical gaining structure intensive browsing an overview. smaller. never as near-neighbor component This individual process. of a small into and Based a small presents on these for clusa small ate for commit the but fact thing session learn to user describing himself may not of interest. extraction. is summarized we manually cluster in figure to simplify labels sion based single-word The full ses- passages in one of interest. satisfying summaries study. documents. terms which is in 2. necessarily instead will then may terms. means. example. A glaring enough. is cluster by enumerating individual documents. obvious and corpus. a technology capable and used to assist near-neighbor for clustering from textbook. consists Tzmes This the the access typically a conventional and the If one has a specific which directs simply eral lays out define one question. the user is presented relevant stories initial a set of clusters. of contents. not be known the to the a topic words user. dynamic of text of q q components: which metaphor uses for a Scatter/Gather. words event”. one consults We now describe collection New gust York 1990. not fail be those to use apthe articles a collection directed. an entire sea~ch with more no about the for well the user to a document at one end document. viewing in this groups.

be specified. and reveals as articles. of that of miscellaneous learns taken user about lost 3 Before review cuss Document presenting why existing in the the terminology Throughout of clusters. Scatter/Gather. cluster given on the existence of two faciliwhich needs. to cluster measure to partition this Clustering our document established paper. stories. Buckshot the Fractionation clustering essential more of the We concise makes entire also algorithm careful. a subset level of the articles. to meet number the we our of and dis- stories 2. documents. description containing collection thus being the major international of a coup in Trinidad.New York Trees News Service. as well an easy to generate. be appreciated requirement. August 1990 Education Domestic Iraq kts sports oil Gemany Legal \Gathe// International Stories Deployment Politics Germany Pakistan Africa Markets Oil Hostages Gather + Smaller International ‘/ Stories Trinidad W. stories month. number pairwise a method lar Scatter/Gather ties. tering able which cluster a cluster for group. documents clustering in prior fail the n denotes and algorithms algorithms. g. revealing by the We of suitable ter/Gather. happened ‘Pakistan’ This in other define contains of specific political Africa. hostages one which large user feels her understanding wishes world. of documents.. will for be sufficiently topic defined to first Scatclusit suitthe of corpus the reduced new clusters The articles the is about but of the also eight. and in Kuwait. work. than the and some clusters effects the user stories cluster. and otherwise to give yet a sense of the for many that of detail invasion into the short two online whose offline first for enough descriptions meet the for on the Iraqi now deployment. upon the oil market. less than some an algorithm one must similarity into establish then define of simimeasures appropriately Second. Africa Figure security International of Scatter/Gather Lebanon Pakistan Japan 1: Illustration This produce Since these original discussing reduced eight corpus new is then reclustered the on the reduced fly to method for automatically This cluster the summarizing description user that must group must clusters corpus reveal articles have military covering contains a finer corpus. of the ‘Oil’ the invasion deduces The corners which a cluster a number a small The among hostages been separated present U. algorithms is a fast reclustering is another. each document as a 320 . simultaneously. situations international in Pakistan. clustering collection. greater to the partitioning user. of document Numerous tolerable for user interaction the collection document treat all of which clusters a minute). part can within First. Africa S. have been proposed. suitable accuracy adequate. to find other articles out foreign about She selects of these what the is algorithm for the static is presented digest. of the a time documents In order k denotes first and similarity desired a and reclustering we need a large a group number is an essential of documents (e.S.2 Requirements depends since clustering basic iteration.

document this perspective. elongated its simplicity time similarity (group-average Although straggly and share for in (complete-linkage). group. in a document. 2]. similar that to [15] than not the documents of length often are equal with frequency overlap information. of the into titioning the an indiA hierar- partitioning algorithm. used to compute defines similarity methods as well behavior. and collection of thus Some of documents algorithms slow or the value be some function If a word A poputhe is zero. which must Global algoO(n”). algorithms defines One improve cluding. availability and are algorithm its computation common they that proceed These algorithms two the certain agglomerative. document of which clustering or a partition hierarchically a tree. of the and mentioned Other greedy. ilarity fusion.. hierarchical proceed built similarity becomes pair is the including. seeds. similarities agglomeration is the pair some of inter-group is considered best or most criterion. which vidual each chical the build These ering which ment gram). document. size is related collection computational obviated for each set to only Willett prominently. assigned size (number in the As a refinement. presence in which or the Partitional position than studied in the If nature above for a flat have value is one or zero. noteworthy be algorithm. and thus minimum measures. potential of a seed-based large not size (relative egy are largely number reason pursued niques acheive number over partitional in the procedure clustering between groups. characteristics. are considered 321 . a few documents. Numerous clustering single-linkage generally of clusters the (which differ greatest then hierarchical algorithms all pairs exhibits group They when document clusters. cally ally. Willett larity results Two structed. each might of a partitional document. clustering be useful fairly tion the near-neighbor search. vectors are normalized the include normalized that that inner the the two have rectangular these algorithms is. sure The tors types) value in the the absence of the does cosine both cosine vectors. considthe pair docudendrosimas one conthe clustering typically popular optimal [1 O]. of cluster clustering clustering of the clusters at each stage.e. it remains of an of documents) required have retrieval partition. a collection teT. single-linkage We present algorithms the hierarchical bounds. must contain by to in- performance of near-neighbor missed. called have a demirogram. length. vec(or has a word the course of selecting an agglomeration. 0( kn). sparse to unit product Dice the partitions. of the word. who a partiwords in the stratto the For this by iteratively and into a node product the fusing of the a single to the number [13].set of words. on to most [5. running times. It is can collecan imthat has suggested the choice types is a flat be defined is then to the closest with. rectangular of clusters time For our application. algorithms typically to the Each represented number component of similarities usefulness. quadratic those into of the nested have of unique their even given lower that sets strive in the corpus. of simi- a number is then of seeds equal partition. by iterainto for in in avis the procedure proved any into can be iterated. sets) tion partitional algorithms. to represent within the that its these of the a hierarchy [8. is a hzerarchzcal as either corpus clustered. ner. Alternative linkage). 13]. word’s occur frequency measure. angle of the vector theoretical strategies. measure. i. from sider erage known forming due space They tively manner. when clusters are large enough to contain a large proportion of the terms in the corpus. of course. to the desired measures Jaccard counts. reflecting document. coefficients. is substantial. otherwise clustering some be closely the has been search related However. been community. between and mea[1 1]. have running all pairs limits the times are intrinsically be considered. similarity as other speedup aggregate quadratic algorithms to have an undesirable the chaining clusters. this approach yields less improvement of document which Lastly. From benefits by the of the so far. in some rithms. clusters of the selection partitional a hierarchical each transformed parof documents by recursively in an application of subsets. been applied docto be documents. bound on perfordecomrather also are been global as algotypiGenermanof the occurrence We may of the corresponding use a binary the might value two scale. z This that degree of word documents by sparse words rithms because sharply attain mance. Other which has simply are proceed by choosing. between performance lar similarity of the document cosine computes vectors. pair choosing in that under all pairs document groups They to agglomerate groups they chosen are global a single document agglomerate in a gmedv 2Willett [14] discusses an reverted file approach which can ameliorate this quadratic behavior when a large number of small clusters are desn-ed. word impact overlap choice of the final Each document seed. uments application the with that for fine. agglomerative by contrast. is small algorithms. partition generates of unique since it is desirable For example. these same global. Unfortunately. which one of the similarity of the two of a previous strategies aggressively use techbut the which the Single-linkage each the by the information two partitioning from time desired drawn the maximum any two individuals. algorithm algorithm found measure different One can less qualitative of clustering of document partztton The other recursively on clustering can be coninto clussets.

is the ith word of w% in a. algorithms. IVI.. II denotes that taking better results similarity where function. the which clusteT clustering our small Buckshot to find used may algorithms They however Both which can their are both be thought output assume well. be the set of words. lower case Greek letters wJ1 be used to documents. We thus w highest two sets group Rather appear define weighted (rm(I’). employing qr./3) = (9(4Q))) 9(@))) and ~. of documents.. p) = (pap) = ~p(ci)zp(p). set of unique as a c(a) occurring of length c(a) = {f(uI. centers.4 The trimming m maybe of ]r 1. adap- where w. each document the partition in the collection to a center. Z=l 5 or a document with of the contained group. the 4. a short The cluster and description 119(4~))11’ in which case of the cluster. our and produces to experience damping II . consider p(a). We algorithms clusters this for our purposes. but the of of as rough existence clustering to define of some run for and this slowly. . profile Let can be associated sum of profiles normalized Seed-based 1 Find 2 Assign k centers. a) is the frequency between the of C(O) cosine and pairs c(f?). cs)}j!~ in V and ~(w%. can be completed parameter percentage in time proportional defined be fixed. sets of documents (document will denote sets of document 4A full sort of the similarities IS not requmed. uses builds average subroutine to find it takes subroutine this (see appendix algorithms sets. namely considering frequently in p(I’) the con- g to be component-wise traditional we can square-root component-wise It is useful document logarithm. and then 3 Refine The that result U==P so constructed. . is in some than most consider sense the to the trimmed central documents wo~ds. document groups such p(r) = Similarly. let the frequencies. cluster digest profiles g(c(ff)) tw(I’)) form of the can easily the p(a) == (m.4 Definitions a in a collection document.1 Another dual the central of words (or Cluster description Digest of a document sum profile. Then Let c(a) (or corpus) with their V be the C. be a function in the group perhaps Taken as a whole. or may 119(C(Q))II119( 4P))II where inner been g is a monotone product. and the Fractionation initial centers. the topical terms of 17. w) tents puted Ivl used of I’. I’). To measure a let S(CY. monotone particular. algorithm Let use The the this cosine profile measure definition: P(x)).. of the Then a in r let rm(17) be the m documents to 17. the normalized group’s which documents sum profile “contents” lie on the can be extended to r by designed is only ~ (p(r). a whose define of the tmmrned only For every similarity group. the tw(I’). measure into account us call group procedure agglomerative B). sum the m To profile “most solve pm(I’) this for problem. we r deby For each document countfile that words vector occur in that central” S(IX. to be the together. is in fact a cluster to a user of Scatter/Gather. namely can be represented ~~~m(r) and Pm(O This tively = &(r)/llfim(r)N. vector than to “(. Upper case Greek letters wdl denote groups) groups. in pm(r)). cluster over 3 Throughout denote individual this paper. outskirts fine the considering cluster.z) Sometimes is not because a good j(r) m.)” denotes It has norm. let element-wise the similarity us employ functions between In computation as some to 1171. those which of a cluster. in time to describe 0(11’ digest be comsummary + lV\). the Each locally of a document subroutine. CXEI? be the unnormalized sum profile. any cluster documents is largest. 17 is a set of documents.3 in C. it in17 by defining phases: Partitional partitional Clustering clustering algorithms have three Suppose A simple to be the dividuals. S(a. is a set P of k disjoint II = C. and upper case Roman letters on its results k centers.

favors j is a small frequency clustering between 323 .. computed by O < i < k is then an ina : C to k groups a I/p individual the Fractionation tree bottom terminating can be viewed up. achieve choose &). each centers center. The partitions implement refinement The 2 by assigniag also reflect limited.2 Fractionation The cluster groups uals These and when to Fractionation C into subroutine such groups that algorithm N/m buckets is then the finds applied k centers to each by initially buckOnce fined signed The Assigning k centers for those to one simplest have Documents been found. such that Q is Each tion = {%( Z-l)+ l. Rz = each clustered where of these clustering (using computations in nm the cluster reducoccurs Each union are That an asso- in the first Step center of Scatter/Gather.)) = Since gorithm this ent random is not operates if m = O(k) on # n items. occur algorithms simplest is fast but a time-accuracy iterated comprehenapplication and clarify el- of agglomerative groups produces refinement procedure. has rectangular experience generally produce qualitatively 5. Refinement an initial a better there clustering. a to the be aseach and group breaking of a fixed individuals in number is roughly as if they The size m > k. ).. of procedures ements is.. define C’={@. separately pm groups. that hence. criterion. of those algorithm. applies the cluster subroutine that subroutine over fixed to a random apgroups This corpus The is procedure ordering initial thus bucketing encourages at least creates nearby a partition individuals in the to find centers. documents Return clearly merely (of size the cenin runs separate agglomeration. is proportional Suppose sic ordering 5.1 Buckshot The Finding Initial Centers C’ inherits with are an enumeration order broken to p2n on i and into groups application the C’ replacing lexicographic peated of the buckshot time random algorithm clustering of the subroutine. application ciated p is the desired time. with a factor a < II. one. That corpus similar Buckshot calls alto differtrials The iteration. sive refinement of the of the documents then treated contained is achieved that attempt P. is the which simultaneously case. .> CG?U}. partition as individuals iteration. of agglomerative to a partition time. in each individual. P of size k. is then into Note and.Buckshot sample placation to find accurate significantly the on-the-fly Scatter/Gather. to the nearest ith can index. is thus this O(nm( algorithm takes 1 +P+P2 is.l. one final determine which Thus time. . partition @8. @. “nearest” (in a sense to be defined all n/m {@z. That C. .. which as three. This on C. center We believe finding and. most ffz.J in re- 5. ordering a better word index could procedure of the Typically medium reflect sorts jth so that an extrinC based comnumterms. running is then are is. . of the collection in G. through to Split.3 Given it into rithms. @. . hence. k centers of this p~(I’. lgj order J. c%. pJ nm. and remain. the the similarity procedure to all the centers. . in these next for the move-to-nearest. if i maximizes individuals. on the same although in our is employed. At the cluster found.@m}m} faster. . is more required can entire is the more Buckshot B={@l. This algorithm the may sample is quite simple.. To of C’ reduced The point reduce To sampling deterministic. branching in each are now process document let r. remain.m}. observe of the O(kn).% (Z-1)+2. Asszgn-to-Nea~est.J: l<i Sri/m. . the pn components further this can jth idea buckets. As with is a tradeoff it is now our initial speed desirable and to refine algoaccuracy. document in C must assigns k groups. @2. overall O(rnn). each document later). to subroutine) in m~ time. The of these into (from were document individof p.. factor. clustering that time +. appropriate by iterations to for of reclustering of the iteration Fractionation partitioning be used establish which corpus. A more repeated Join. be efficiently for the computing cost group Let G be a partition be the Ties s(cz17i). . pn/m through at iteration groups running time by The Spin} taking process which the @. online Fractionation However. Fractionation uses successive sized to have one word in common. . the process the terminates remaining j if # n < k. of the cluster centers. terminates be broken The by assigning the entire only iteration lowest set P = {11%}. partition. where when the only the desired P can verted In any kn. running algorithm partitions. . Let into ets separately groups to agglomerate reduction bucket) treated repeated. and based to Centers profiles de- suitable on some centers.2. . ters time a rectangular a small and apply clusters algorithm. on a key mon ber. procedure. repeated produce repeated partitions. . . map as building leaves k roots c=al. word such are constructing for each documents. the primary displayed We the Our tradeoff. individuals but in C are enumerated.

few steps. running and use the perform and time thus to be too slow even for offline algorithm of refinement operators. the trimmed and we then so as to form indefinitely. affect Not e that procedures running time.2} new Buckshot of 17. of refinement. as a feature. initial of the various give several used We have partition by corpus element is simply two the in of these combinations of implementing used the Scatter/Gather Scatter/Gather under P’ = UG.. O(kn). would some iterations procedure P) not < pk for change criterion only of the split groups such the that cosonably produces minishing By virtue algorithm may in fact worsen the partition clusters can pull even fuzzier. 0 < P < cluster in the its greatest iterated Determining proportional pus.. then p. be interpreted of discarding reclustering. split I’). clustering. A(rk)}. to find corpora a quadratic to II’. We W. union G. A(r. never their lists Nearest process merges algorithm separated similar. in common. using the is is in fact documents proportional average as to the centroid. P) r). the to then clusters. to run accuracy. to the as well cluster only s(I’. to decide number clusters is typically to merge. clusters into which to iterate The two Split well the Assign-tosepaparts Join The purpose of the groups elements documents words may by their Join refinement digests.. the partition might Scatter/Gather. of this is the procedure on some cluster in the would coherency self similarity to the cluster. pletely Hence. well overlap. does to N.. of the Assign-to-Nearest improvement. We for the cluseven at the the that yield quickly Bucka two a readithis it with as quickly use follow found be the shot bare ). it makes is typically can be iterated gains only in the first a small fixed though and hence of times. sum profiles to the process nearest we generate above. however. The Let P={ resulting Fl.5 > p.. since ‘{fuzzy” elongated other clusters and become the algorithm minism then in favor be deterministic. is to merge usefully have of be by definition.. number t~ (I’) r and is the list A if I“(I’. of Buckshot the overall to N... the k 6 ment the Application procedures course The to Scatter/Gather initial possible clustering complete and refinein clustering method. Therefore groups Since.. ) in the set score criterion is available in advance. = {17. procedure..1. therefore and We and then have We thus centers. center ItU(r) n tu(A)l of w most A) the words number In large is thus topical some for of words each words the for 17. I’z. 1’. the overall it is vital as possible. The i-(r. contemthan the that user 5Excessive Iteration improving away from rather than documents plated that application. This modification since the can order algorithm in time herence be computed proportional of the Buckshot is not deterministic. {A(r. determined when the the initial clustering partition. The G. at high Indeed. two between are not will Iterated The also From using each Assign-to-Nearest procedure first of our mentioned refinement above can Assign-to-Nearest be seen a given document This as the “topical” the criterion A will algorithms. for t=l Each application proportional that quantity between similarity define: = s(r. . corpus offline. one can use of of algo- groups. Hence. the partition algorithm We can therefore accuracy consisting time computation. simirequires time proportional compute a slower the tens rithm and Split. Fractionation a great deal of thousands is likely then Join. in time A groups One This larity average We thus A(r) Let r(f’. running groups. for P.The rates and simplest poorly Join process just defined is simply discussed. 1. where merge time which A) = 17 and set of clusters. procedure in the speed the lack since However. Assign-to-Nearest does not session. modification simple computation can be performed to improve initial However.. the O(k N) tering expense rank of A(I’.. less than of Join time Split Split two divides new each document This can (without Buckshot J7~} and partition of the group !J in a partition with P into words the the number of documents. has the option of a fresh an unrevealing 324 . clustering be accomplished refinement) partition let by applying C = r and the be a G provides Buckshot two two P’ new k = 2. of documents. poorly criterion.). clusters are too document distinguished any two “typical” in a partition cluster of P are disjoint. be computed it is more important of deterpartition it. of a document for each of the refinement In an interactive algorithm of some finding center minimum accurate additional returns. center procedure that but finding further with refinement O < p s 1. cluster centers assign new of distinguishability T(r. and topical takes corof and we must compute k2 intersections corpora. ‘s: partition Combinations algorithms. is comconsideration. operator P that they However.

note actual not that if we have some pair Then same each titles bucket. when to any other we finish. in and experience it is difficult that it is indeed helpful in sitmust to specify easy to use. in figure issues). the Buckshot international create clustering for two the goal applied in section certainly see this. Scatter/Gather which Claims undesirable formally. we iterate. Conclusion demonstrates information metaphor contenks out. of one of the will include a clusters Thus is described The number near frequent oil) issues) and first clusters. For extremely 6. Each its Fractionation cluster the is recommended with line the each cluster. of the cluster Fractionation to develop linear time scale that be too under We are working It is worth when points. then algorithms clusters i. algorithms by the motivation by the quality highly running of the clustering clustering algorithms For Buckshot. find with with k = 20. namely is at most a. contains within than this will each actual k documents pair ever will the cluster. of documents corpus the globally. and probably hostages Africa. separated of size & centers will To size centers. will the Buckshot which be feasible. display number parti(figtime of of and line in to Scatter/Gather Times session SeTvice articles. of news this Thus that our in our we start resulting case. smallest the largest our the performance of well has k natural similarity ations large on Scatter/Gather always accuracy is affected slow to find their to arbitrarily the assumption If the processing Clearly. detail more titles international have been war issues) 7 incidents so on. pair. any to get A A Scatter/Gather Session probability k times a sample some none k(l of size . we fail is at most individual the probability of cluster of failure for s individuals So. the New of the The corYork 1990. we have clusters. no pair found will of documents be a subset be merged. and will with partition. Therefore. of them is necessarily we need merely in in the be merged a single same of documents words clusters cluster. document the clustering an in- We obtain viewing (figure the situation contained in Liberia in that access tool has shown is particularly or in its own right. permitting. in the We select including international (figure 2 (Iraq’s as those recluster. other Specific and clusters and separated police by 3 (Pakistan. center more clearly.1 Naturally examining data set input Clustered the consists data Data of our separated clusters. whatever high probability as n >> k in k. is to learn month. of Kuwait).The described factor overall complexity section of both is clearly clustering O(klV). then of articles of roughly (1 – I/k)’. 3).s. if we choose from that In figures pus This about updates Here events a = 5 means in 1000. cluster digest. we present described during Some the distributed month are 30 megabytes articles about To the full by output 2. action that give. stories. cluster cluster actual of centers cluster. both is larger of the than of achieved may corpora. of improved performance 325 . Clustering to deal by with the slow. Fractionaprovides al- similarity. Scatter/Gather can be an effective The tuitive uations a query table-of. articles gest Next. larger To support are essential. invasion which in the cluster. procedures The constant enough await fined excels. The second For Fractionation. but procedure one which is still has a somewhat acceptable be done groups entire even or will constant for offline manner on small large corpora.e.. of interest. and probably seem likely 5 (Marother dito contain in preference Thus. the failure of August of ASCII repeated some is at most – l/k) Gklnk < kl-a. in a local than tering rithms trying Scatter/Gather. text due political initial algorithm this line the task. the probability k(l total If we now is a member probability take s = aklnk Z. 4 (African of documents the centroid. cluster. 6 (Germany. linear applications. from that Given tion is the News consists 5000 our during 2 through set 5. We find in Trinidad. be true compute This case in which that. tion provided intra-cluster inter-cluster will equal select document document find this Buckshot subroutine. taking one element each we’ve 400 samples all the clusters at least will set the 999 times be a subset found ure 2). a corpus documents so long containing a random from each k widely sample of the then if we have some for the our further gorithms. and display this time a new cluster selecting of some one of the 4. about in Liberia.1. in the cluster. evaluation information metrics access appropriate goals in to the which vaguely de- in this the Scatter/Gather for Buckshot-based use with procedure large is small to permit The interactive document collections. contains kets. method of the articles cluster basis. accurate time and This may be. can the fast clustering quickly algorithms by working rather time clusalgovaripre- Fractionation-based factor. This k = 20 or so. in South the 5). of our – I/k)’.

. SAYS million. price. A LINE” saudi. music. lraqi. THE PROBLEM . PUSHES stock. HE. PENTAGON iraql. 60. IN. government. child. Judge. 6 (310) LEADERS year. DAYS WHY MAJOR rate. PUSH FOR UNIFICATION. year. GOVERNMENT EXCEEDS DISMIS. PANIC week. NEGOTIATIONS FARM BANK.. educ T nearest. company.sizes: 730 650 67 66 IRA. TWISTS year. LEGISLATIVE state. unite. New Musical movie. day. million. 677 481 1020 844 275 (287) school. ROAD STILL campaign. OIL PRICES percent. MANDELA national. TAKES unlverslty. WORK PROGRAMS city. “DRAWS unite. official. IN AN EARL. IN party. OIL SAX LOOKING win. DID JUDGE country. mumster. PUSH FOR UNIFICATION. LEADERS TOUGH FOR democratic election. . company. political. RISE week. TO PR. from art. I. market. PRICES. MAY MEA. 7 (126) kuwait FOREIGN iraq. the cre. IRA. Many railltary. company. 60. fi~ht. offzci.. BUSH SAYS state. SUBJECTS study. A LINE sa 117 242 586 (650) iraq. MIDEAST company. season. palica. american. . player. PENTAGON Iraqi.sizes: . 01. AND TURNS team. real 54184 Figure3: Second Scatter 326 . MERRILL pres OVE f F ara price. PAKISTANIS 3 4 5 (117) (59) a+r~ean. year. trial. 65 57 51 62 8 5 5 4 56 59 99 7 28 714 15 nearest. FOREIGNERS military. year. invasion. (57) german. PAKISTAN. market. G. 131258 ORDER FREEI. BUSH state. candidate. kuwalt. sell. FIRST year. country. . AMERICANS official. THE directo DODGER play. CRITICS year. .> (time (setq 4970 cluster first items 199 (outline (all-dots tdb)))) cluster global move move 0 to to Items. FOR RELIEF day. OVER A M. FEDERAL state. soviet. state. DE KLERK. 1 2 (66) party. stock. union.000 kuwait. WITH american. WHY MAJOR year. BUSH “DRAWS president. RESORT day. COUNCIL preslden AS 1 s RE 5 (844) price. After week. vote. real U. . PUSHES percent. CRISIS day. corp. year. official. sizes: . nearest. government. CORP. msec year. 4 (481) game. (242) company.. CRISIS oil. msec Figure 2: Initial Scattering > (time (setq 1903 cluster second Items 123 (outline first 2 5 6))) cluster global move move 0 to to items. federal. OF TWO G. 3 (275) film. court. Hats. RISE EXECUTIVE 6 (586) oil. official. 237 mdlzon. mllxtary. sizes: 517 287 1293 1731 18 835 749 24 86 53 5 25 47 13 273 310 14 269 293 THE. coach. stock. OF 12 CONCERN HEIGHTENS country. government. STEPS 1 (1731) year. SECURITY leader.. WOMAN TELLS saudi. PROGRAMS FOR PARENTS state. league.000 american. york. . Trillin’s year. time APPEALS charge. URGE NEW METHODS. OF TWO GERMA unificati FEEL LET sta TO SETT offlclal.sizes: student. I. GRANTS WEST GERMANS TO BUY FIRE. TOLL west. CONVICT dlst lawyer. east. BUSH 2 (749) iraq. BHUTTO DEATH germany. IN LEADERS BACK. attorney. EA. 7 (293) case. government. TEACHING percent. IRAQ . share. . 500 south. program. right. play. AS STOCK. HOLD U. percent. MOVE TOO HASTI. york. OF TWO GERMANYS state. PRICES rate. OIL day.sizes: SAYS 110 126 DET. Nasty Teen-Agers 1 american.S. time market. angeles. year. group. leader. PANIC mxllion.al. MAYOR BARRY Jury. state. REPRESENTATIVES political. party. nearest. PAINTING hit. I. FRACTIOUS polltlcal. SAYS CUT BACK house FOREIGNER presld service.

S. official. 5 (7) CLASHES BETWEEN RIVAL FACTIONS kill. nearest. SH. TROOPS REBEL LEADER. 500 black. troop. MANDELA africa. bakr. OVER BUT BOO robinson FORCE S officla FACTIONS gove T wednesday. TO. SHIFT minister. government. tokyo. real time 11140 msec japanese. war. mandela. NEGOTIATIONS liberia. BHUTTO YEARS CALLS BHUTTO government. END.slzes: . DEATH pollee. L. korean. year. soviet. NEGOTIATIONS EXCEEDS DE KLERK.sizes: . government. NEW U. politlcal. congress. 4 (51) to to items. AFTER pakxstan. south. IN POLICY war. 7 71 7 55 MUSLIM MILITANTS trinidad. parliament. communication.> (time (setq 176 cluster third items 37 (outline second 3 4))) cluster global move move 0 1 (5) (16) rebel. al. SOUTHERN amal HER WORL lebanon. group. 2 (28) south. L. U. TO VISIT politlcal. COMPETING ant.. african. @SECURITY army. IN U. MS. I. MUSLIM party. REACHES. SECURITY year. DISMIS. JAPAN’S president. COMPUTER computer. REACHE. 45 military. TOLL DOW.sizes: 4 5 LAY 16 16 44 28 1 4 1 23 1 51 MUSLIM 12 1 5 3 8 3 10 13 LAY DOW. S. BOMBINGS aoun. MS. TO SETTLE group. BHUTTO chru+tlan. beirut. w Figure 4: Third Scatter > (prmt-titles (nth 1 thmd)) 3720 REBEL LEADER SEIZES ABOUT A DOZEN FOREIGNERS 4804 4778 3719 3409 3114 3113 2785 2784 2783 2782 1801 1685 1684 248 WEST AFRICAN FORCE SENT TO LIBERIA AS TALKS REMAIN DEADLOCKED TALKS INEVITABLE FAILURE FAILURE POLICY POLICY WAR THREATENS TO WIDEN AS NEIGHBORING COUNTRIES TAKE SIDES REBEL LEADER AGREES TO HOLD CEASE-FIRE OUSTER OF LIBERIAN PRESIDENT NOW SEEMS NEGOTIATIONS NEGOTIATIONS LIBERIANS LIBERIANS LIBERIAN LIBERIA FIVE FACES FACES OUSTER IN IN TO SETTLE TO SETTLE U.. CALLS IN GOVERNMENT minister. taylor. LEADER CHARLES HURT EN ROUTE TO CEASE-FIRE WON’T QUIT REJECTING NATIONS LIBERIA LIBERIA TRUCE MOVING WEST AFRICAN OF DEATH OF DEATH IN IN TOWARD LIBERIA OF LIBERIAN PRESIDENT NOW SEEMS INEVITABLE Figure 5: Titles of articles in topic 1 from Figure 4 327 . WEST AFRICAN leader.S. DRAMA IS hostage. guerrilla. ROLE SHEVARDNADZE WAR’S Japan. nearest. llberian. 6 7 (55) (13) muslim. national. prime.S. african. center. lebanese. LIBERIAN LIBERIAN WAR END IN WAR END IN CRITICAL CRITICAL OF ADMINISTRATION OF ADMINISTRATION TAYLOR OFFER. I. milit IS natio W COUNCIL state. HER OUS. TO SETTLE west. party. BATTLE god.S.. 3 (1) security. COUNCIL country. technology.. HOLD U. korea. MILITANTS government. agency.

P(cY)) Clcr + Irl. Initially.(Irl + IAI) = (Irl+ lAl)((lrl+ IAI). (@(I’]. could be computed in a dendrogram. Average a quadratic algorithm This time which Clustering greedy global good to the average agglomresults algorithm similarity has given then the and finding the best pair would simply involve scanning with each I“ the ag- IGI candidates. the join cosine partition although each pairwise employ the terminates this a nested procedure hierarchy of partitions. produce be cheaply merge similarity the for least updated is performed. Let G be a set of group of disjoint average clusters from G. the as one level similarity sum profile ner maximization can be significantly accelerated. such techniques clustering only involving implementation. we G is simply A’}) u {r’ u A’}. A = I’ U A ‘(A) where (IXA).1) (@( A). partition merging G’ = (G – {I”. associated S(I’). groups. with which is then all choices A’. S(I’) will Further. that n A). and fi(r) fI(A)) are known. known suppose s(r n A) = ~j~s(r 32’8 . IXA)) . I“ smaller. similarity. The basic finds S(r U by iteration A) A over clustering maximize constructed the two different new.Irl I) “ of two disjoint groups. measure. is simply is. be recomputed. the pairwise in average a merge the A were if for every that can such j(A)) + (j(A). ~(1’)). is 0(n2) to the documents in 17 is defined to be of individuals to be clustered. Note that the G’. since related ~~rp~r = lrl(\rl lrl(lrl . A’ need time Updating these since quantities those iteration Using average glomerative number is straightforward. Therefore. document agglomerative I“ and A’ G’ groups. individual IG’I = k. for it can be seen that group average where n is equal presented complexity truncated group.l)s(r) . flat the final recording If that r. = (~(r) ii(r)) Irl(lrfor the union .B Here erative in our Let between Group we present clustering in [3]. That Recall P( 17) is the unnormalized Then the to the average inner pairwise product. decrease each every time 17 E G 1? c G. r be a document any two is similar The as these. + ~(P(cY).l)s(r) = s(r) Similarly. one for each when is by inwith a set of singleton The output rather latter iteration from than to be clustered. P(A)) = {pfrh~(r)} + 2(j(r).

329 . Salton. [12] C. Using inter-document retrieval systems. use of hiInfor1971. 231. similarity Jou?’nal Sczence. second van Rijsbergen An evaluation and W. Tukey.B. International Development 1985. file approach.J. large files Info?’matzon P. P. A~goTithms Cliffs. in document Society AmeTican 1986. [7] J. R. clustering: of documents us- Info~matzon 1988. Information 1975. and 97-11O. Engelwood N. 1980. fo~ N.J. Hierarchical method. search: In Snippet access. In ACM Optimizations of the SIGIR in Confer- [14] P. Document Butter- worths. Processing Management. W. foT Information [5] Anil 07632. Ward’s in of the Ame?’zcan 28:341-344. Cutting. and D. SLINK: link 1973. Jain 1988. efficient ComputeT to Modern Retrieval. O. M. 1979. 24(5):577-597. l+ocessmg a Management. Willett. SSL- Meetings. Luckhurst. and P. The retrieval. Croft. Retrzeval.C. ings seamh pages [4] A. and J. Jardine erarchical mation C.J. 1986. Griffiths. [13] P. the single-link for Journal Science. 1971.. for the single 16:30-34. Willett. CiusteTing Data.J. J. A fast procedure ~ for the calculation 17:53-60. Sibson. McGill. of some clusthe & experiments with 1400 collection. Willett. Lewit. of similarity Information 1981. an optimally algorithm Journal. McGraw-Hill. Clustering method. Retmecal N. [6] K. tering: Cranfield London. J. Recent A critical trends review. phrase of the Statistical PARC 7:217–240. van cluster method. Hall. Willett. Info?’matzon and using E1-Hamdouchi clustering of the Ninth ument In Proceed- International and Development 149-156. Pedersen. Document Journal clustering of Information using an inverted 2:223- Sczence. classification. Information Conference 1977. al Also 91-08. Willett. Salton Information [10] R. Hall. and Richard Pretice information of the 37:3-11. technical report The SMART Cliffs. a single approach 1991 Joznt Proceedings American as Xerox Statistic1991. [11] C. Rijsbergen. [15] of inverted vector searches. ~ntToduciion 1983.B. Dubes. available Association. 11:171–182. to text clustering Storage in information RetTzeval. ing Croft. van Rijsbergen. l%oceedings coefficients Processing in automatic Management. H. System. Prentice- Englewood and [9] G. and C. Information edition. Soczety [3] A.References [1] Chris Eighth ence Buckley Annual on Research pages and Alan F. [8] G. docon ReRetTzeval. [2] W. in hierarchical document RetTieval.

Sign up to vote on this title

UsefulNot useful- Biometric Authentication Using Nonparametric Methodsby International journal of computer science and information Technology (IJCSIT)
- Cluster Analysisby Saarthi Gandotra
- Clustering of Structured Spatial Objectby inventionjournals
- D-LDA - A Topic Modeling Approach Without Constraint Generation for Semi-Defined Classificationby Shay Tay

- QDC
- Clustering With Multiviewpoint-Based
- Clustering With Multiviewpoint-Based Similarity Measure
- Abstract- Clustering With Multi-Viewpoint Based Similarity Measure
- Paper-2 Clustering Algorithms in Data Mining a Review
- Investigating the Validity of Conventional Joint Set Clustering Methods
- Enhanced multi-objective Fuzzy Clustering base protocol for 3D WSNs
- 10.DM1
- GFA-İngilizce
- 10.1.1.48.651
- Clustering - Ontology Algorithm+++
- K Mean Clustering
- Predicting Types of Glass Final
- IOSR Journals
- Cluster Bio
- Research on Web Session Clustering
- Biometric Authentication Using Nonparametric Methods
- Cluster Analysis
- Clustering of Structured Spatial Object
- D-LDA - A Topic Modeling Approach Without Constraint Generation for Semi-Defined Classification
- 04_Chapter4
- ijdkp030207
- chap8_basic_cluster_analysis.ppt
- Composing With Computers. A Progress Report by L.Hiller, 1981
- 10.1.1.18
- Analysis on different Data mining Techniques and algorithms used in IOT
- 43.Republic of Liberia vs. Bickford
- Wednesday, February 05, 2014 Edition
- LiberiaChristianCollege-1973-Liberia.pdf
- Spa Question Bank
- Week4.Cutting.et.Al.1992.Scatter Gather