Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Canopy Kdd00

Canopy Kdd00

Ratings: (0)|Views: 22|Likes:
Published by Shanti Prasad

More info:

Published by: Shanti Prasad on Dec 23, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Efficient ClusteringofHigh-DimensionalData Setswith Applicationto ReferenceMatching
Andrew McCallum
WhizBang! Labs - Research4616 Henry StreetPittsburgh, PA USA
mccallum@cs.cmu.eduKamal Nigam
School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA USA
knigam@cs.cmu.eduLyle H. Ungar
Computer and Info. ScienceUniversity of PennsylvaniaPhiladelphia, PA USA
Many important problems involve clustering large datasets.Although naive implementations of clustering are computa-tionally expensive, there are established efficient techniquesfor clustering when the dataset has either (1) a limited num-ber of clusters, (2) a low feature dimensionality, or (3) asmall number of data points. However, there has been muchless work on methods of efficiently clustering datasets thatare large in all three ways at once—for example, havingmillions of data points that exist in many thousands of di-mensions representing many thousands of clusters.We present a new technique for clustering these large, high-dimensional datasets. The key idea involves using a cheap,approximate distance measure to efficiently divide the datainto overlapping subsets we call
. Then cluster-ing is performed by measuring exact distances only betweenpoints that occur in a common canopy. Using canopies, largeclustering problems that were formerly impossible becomepractical. Under reasonable assumptions about the cheapdistance metric, this reduction in computational cost comeswithout any loss in clustering accuracy. Canopies can beapplied to many domains and used with a variety of cluster-ing approaches, including Greedy Agglomerative Clustering,K-means and Expectation-Maximization. We present ex-perimental results on grouping bibliographic citations fromthe reference sections of research papers. Here the canopyapproach reduces computation time over a traditional clus-tering approach by more than an order of magnitude anddecreases error in comparison to a previously used algorithmby 25%.
Categories and Subject Descriptors
Pattern Recognition
]: Clustering; H.3.3[
InformationStorage andRetrieval
]: Information Search and Retrieval—
Unsupervised clustering techniques have been applied tomany important problems. By clustering patient records,health care trends are discovered. By clustering addresslists, duplicate entries are eliminated. By clustering astro-nomical data, new classes of stars are identified. By cluster-ing documents, hierarchical organizations of information arederived. To address these applications, and many others, avariety of clustering algorithms have been developed.However, traditional clustering algorithms become compu-tationally expensive when the data set to be clustered islarge. There are three different ways in which the data setcan be large: (1) there can be a large number of elements inthe data set, (2) each element can have many features, and(3) there can be many clusters to discover. Recent advancesin clustering algorithms have addressed these efficiency is-sues, but only partially. For example, KD-trees [15] providefor efficient EM-style clustering of many elements, but re-quire that the dimensionality of each element be small. An-other algorithm [3] efficiently performs K-means clusteringby finding good initial starting points, but is not efficientwhen the number of clusters is large. There has been al-most no work on algorithms that work efficiently when thedata set is large in all three senses at once—when there aremillions of elements, many thousands of features, and manythousands of clusters.This paper introduces a technique for clustering that is effi-cient when the problem is large in all of these three ways atonce. The key idea is to perform clustering in two stages,first a rough and quick stage that divides the data into over-lapping subsets we call “canopies,” then a more rigorousfinal stage in which expensive distance measurements areonly made among points that occur in a common canopy.This differs from previous clustering methods in that it usestwo different distance metrics for the two stages, and formsoverlapping regions.The first stage can make use of extremely inexpensive meth-ods for finding data elements near the center of a region.Many proximity measurement methods, such as the invertedindex commonly used in information retrieval systems, arevery efficient in high dimensions and can find elements nearthe query by examining only a small fraction of a data set.
Variants of the inverted index can also work for real-valueddata.Once the canopies are built using the approximate distancemeasure, the second stage completes the clustering by run-ning a standard clustering algorithm using a rigorous, andthus more expensive distance metric. However, significantcomputation is saved by eliminating all of the distance com-parisons among points that do not fall within a commoncanopy. Unlike hard-partitioning schemes such as “block-ing” [13] and KD-trees, the overall algorithm is tolerant toinaccuracies in the approximate distance measure used tocreate the canopies because the canopies may overlap witheach other.From a theoretical standpoint, if we can guarantee certainproperties of the inexpensive distance metric, we can showthat the original, accurate clustering solution can still berecovered with the canopies approach. In other words, if the inexpensive clustering does not exclude the solution forthe expensive clustering, then we will not lose any clusteringaccuracy. In practice, we have actually found small accuracy
due to the combined usage of two distance metrics.Clustering based on canopies can be applied to many differ-ent underlying clustering algorithms, including Greedy Ag-glomerative Clustering, K-means, and Expectation-Maxim-ization.This paper presents experimental results in which we applythe canopies method with Greedy Agglomerative Clusteringto the problem of clustering bibliographic citations from thereference sections of computer science research papers. The
research paper search engine [11] has gathered over amillion such references referring to about a hundred thou-sand unique papers; each reference is a text segment exist-ing in a vocabulary of about hundred thousand dimensions.Multiple citations to the same paper differ in many ways,particularly in the way author, editor and journal names areabbreviated, formatted and ordered. Our goal is to clusterthese citations into sets that each refer to the same article.Measuring accuracy on a hand-clustered subset consisting of about 2000 references, we find that the canopies approachspeeds the clustering by an order of magnitude while alsoproviding a modest improvement in clustering accuracy. Onthe full data set we expect the speedup to be five orders of magnitude.
The key idea of the canopy algorithm is that one can greatlyreduce the number of distance computations required forclustering by first cheaply partitioning the data into over-lapping subsets, and then only measuring distances amongpairs of data points that belong to a common subset.The canopies technique thus uses two different sources of information to cluster items: a cheap and approximate sim-ilarity measure (e.g., for household address, the proportionof words in common between two address) and a more ex-pensive and accurate similarity measure (e.g., detailed field-by-field string edit distance measured with tuned transfor-mation costs and dynamic programming).We divide the clustering process into two stages. In the firststage we use the cheap distance measure in order to cre-ate some number of overlapping subsets, called “canopies.”A canopy is simply a subset of the elements (
datapoints or items) that, according to the approximate simi-larity measure, are within some distance threshold from acentral point. Significantly, an element may appear undermore than one canopy, and every element must appear in atleast one canopy. Canopies are created with the intentionthat points not appearing in any common canopy are farenough apart that they could not possibly be in the samecluster. Since the distance measure used to create canopiesis approximate, there may not be a guarantee of this prop-erty, but by allowing canopies to overlap with each other, bychoosing a large enough distance threshold, and by under-standing the properties of the approximate distance mea-sure, we can have a guarantee in some cases.The circles with solid outlines in Figure 1 show an exampleof overlapping canopies that cover a data set. The methodby which canopies such as these may be created is describedin the next subsection.In the second stage, we execute some traditional cluster-ing algorithm, such as Greedy Agglomerative Clustering orK-means using the accurate distance measure, but with therestriction that we do not calculate the distance between twopoints that never appear in the same canopy,
we assumetheir distance to be infinite. For example, if all items aretrivially placed into a single canopy, then the second roundof clustering degenerates to traditional, unconstrained clus-tering with an expensive distance metric. If, however, thecanopies are not too large and do not overlap too much,then a large number of expensive distance measurementswill be avoided, and the amount of computation requiredfor clustering will be greatly reduced. Furthermore, if theconstraints on the clustering imposed by the canopies still in-clude the traditional clustering solution among the possibil-ities, then the canopies procedure may not lose any cluster-ing accuracy, while still increasing computational efficiencysignificantly.We can state more formally the conditions under which thecanopies procedure will perfectly preserve the results of tra-ditional clustering. If the underlying traditional clusterer isK-means, Expectation-Maximization or Greedy Agglomer-ative Clustering in which distance to a cluster is measuredto the centroid of the cluster, then clustering accuracy willbe preserved exactly when:For every traditional cluster, there exists a canopysuch that all elements of the cluster are in thecanopy.If instead weperform Greedy Agglomerative Clustering clus-tering in which distance to a cluster is measure to the closestpoint in the cluster, then clustering accuracy will be pre-served exactly when:For every cluster, there exists a set of canopiessuch that the elements of the cluster “connect”the canopies.
Figure 1: An example of four data clusters and thecanopies that cover them. Points belonging to thesame cluster are colored in the same shade of gray.The canopies were created by the method outlinedin section 2.1. Point A was selected at randomand forms a canopy consisting of all points withinthe outer (solid) threshold. Points inside the in-ner (dashed) threshold are excluded from being thecenter of, and forming new canopies. Canopies forB, C, D, and E were formed similarly to A. Notethat the optimality condition holds: for each clus-ter there exists at least one canopy that completelycontains that cluster. Note also that while there issome overlap, there are many points excluded byeach canopy. Expensive distance measurements willonly be made between pairs of points in the samecanopies, far fewer than all possible pairs in the dataset.
We have found in practice that it is not difficult to create in-expensive distance measures that nearly always satisfy these“canopy properties.”
2.1 Creating Canopies
In most cases, a user of the canopies technique will be able toleverage domain-specific features in order to design a cheapdistance metric and efficiently create canopies using the met-ric. For example, if the data consist of a large number of hospital patient records including diagnoses, treatments andpayment histories, a cheap measure of similaritybetween thepatients might be “1” if they have a diagnosis in commonand “0” if they do not. In this case canopy creation is triv-ial: people with a common diagnosis fall in the same canopy.(More sophisticated versions could take account of the hier-archical structure of diagnoses such as ICD9 codes, or couldalso include secondary diagnoses.) Note, however, that peo-ple with multiple diagnoses will fall into multiple canopiesand thus the canopies will overlap.Often one or a small number of features suffice to buildcanopies, even if the itemsbeing clustered (e.g. the patients)have thousands of features. For example, bibliographic ci-tations might be clustered with a cheap similarity metricwhich only looks at the last names of the authors and theyear of publication, even though the whole text of the ref-erence and the article are available.At other times, especially when the individual features arenoisy, one may still want to use all of the many features of an item. The following section describes a distance metricand method of creating canopies that often provides goodperformance in such cases. For concreteness, we considerthe case in which documents are the items and words inthe document are the features; the method is also broadlyapplicable to other problems with similar structure.
2.1.1 A Cheap Distance Metric
All the very fast distance metrics for text used by searchengines are based on the inverted index. An inverted indexis a sparse matrix representation in which, for each word,we can directly access the list of documents containing thatword. When we want to find all documents close to a givenquery, we need not explicitly measure the distance to alldocuments in the collection, but need only examine the listof documents associated with each word in the query. Thegreat majority of the documents, which have no words incommon with the query, need never be considered. Thus wecan use an inverted index to efficiently calculate a distancemetric that is based on the number of words two documentshave in common.Given the above distance metric, one can create canopiesas follows. Start with a list of the data points in any order,and with two distance thresholds,
, where
> T 
.(These thresholds can be set by the user, or, as in our ex-periments, selected by cross-validation.) Pick a point off the list and approximately measure its distance to all otherpoints. (This is extremely cheap with an inverted index.)Put all points that are within distance threshold
into acanopy. Remove from the list all points that are within dis-tance threshold
. Repeat until the list is empty. Figure 1shows some canopies that were created by this procedure.The idea of an inverted index can also be applied to high-dimensional real-valued data. Each dimension would be dis-cretized into some number of bins, each containing a bal-anced number of data points. Then each data point is ef-fectively turned into a “document” containing “words” con-sisting of the unique bin identifiers for each dimension of thepoint. If one is worried about edge effects at the boundariesbetween bins, we can include in a data point’s document theidentifiers not only of the bin in which the point is found,but also the bins on either side. Then, as above, a cheapdistance measure can be based on the the number of binidentifiers the two points have in common. A similar proce-dure has been used previously with success [8].
2.2 Canopies with Greedy AgglomerativeClustering
Greedy Agglomerative Clustering (GAC) is a common clus-tering technique used to group items together based on sim-ilarity. In standard greedy agglomerative clustering, we aregiven as input a set of items and a means of computing thedistance (or similarity) between any of the pairs of items.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->