Variants of the inverted index can also work for real-valueddata.Once the canopies are built using the approximate distancemeasure, the second stage completes the clustering by run-ning a standard clustering algorithm using a rigorous, andthus more expensive distance metric. However, signiﬁcantcomputation is saved by eliminating all of the distance com-parisons among points that do not fall within a commoncanopy. Unlike hard-partitioning schemes such as “block-ing”  and KD-trees, the overall algorithm is tolerant toinaccuracies in the approximate distance measure used tocreate the canopies because the canopies may overlap witheach other.From a theoretical standpoint, if we can guarantee certainproperties of the inexpensive distance metric, we can showthat the original, accurate clustering solution can still berecovered with the canopies approach. In other words, if the inexpensive clustering does not exclude the solution forthe expensive clustering, then we will not lose any clusteringaccuracy. In practice, we have actually found small accuracy
due to the combined usage of two distance metrics.Clustering based on canopies can be applied to many diﬀer-ent underlying clustering algorithms, including Greedy Ag-glomerative Clustering, K-means, and Expectation-Maxim-ization.This paper presents experimental results in which we applythe canopies method with Greedy Agglomerative Clusteringto the problem of clustering bibliographic citations from thereference sections of computer science research papers. The
research paper search engine  has gathered over amillion such references referring to about a hundred thou-sand unique papers; each reference is a text segment exist-ing in a vocabulary of about hundred thousand dimensions.Multiple citations to the same paper diﬀer in many ways,particularly in the way author, editor and journal names areabbreviated, formatted and ordered. Our goal is to clusterthese citations into sets that each refer to the same article.Measuring accuracy on a hand-clustered subset consisting of about 2000 references, we ﬁnd that the canopies approachspeeds the clustering by an order of magnitude while alsoproviding a modest improvement in clustering accuracy. Onthe full data set we expect the speedup to be ﬁve orders of magnitude.
2. EFFICIENT CLUSTERINGWITH CANOPIES
The key idea of the canopy algorithm is that one can greatlyreduce the number of distance computations required forclustering by ﬁrst cheaply partitioning the data into over-lapping subsets, and then only measuring distances amongpairs of data points that belong to a common subset.The canopies technique thus uses two diﬀerent sources of information to cluster items: a cheap and approximate sim-ilarity measure (e.g., for household address, the proportionof words in common between two address) and a more ex-pensive and accurate similarity measure (e.g., detailed ﬁeld-by-ﬁeld string edit distance measured with tuned transfor-mation costs and dynamic programming).We divide the clustering process into two stages. In the ﬁrststage we use the cheap distance measure in order to cre-ate some number of overlapping subsets, called “canopies.”A canopy is simply a subset of the elements (
datapoints or items) that, according to the approximate simi-larity measure, are within some distance threshold from acentral point. Signiﬁcantly, an element may appear undermore than one canopy, and every element must appear in atleast one canopy. Canopies are created with the intentionthat points not appearing in any common canopy are farenough apart that they could not possibly be in the samecluster. Since the distance measure used to create canopiesis approximate, there may not be a guarantee of this prop-erty, but by allowing canopies to overlap with each other, bychoosing a large enough distance threshold, and by under-standing the properties of the approximate distance mea-sure, we can have a guarantee in some cases.The circles with solid outlines in Figure 1 show an exampleof overlapping canopies that cover a data set. The methodby which canopies such as these may be created is describedin the next subsection.In the second stage, we execute some traditional cluster-ing algorithm, such as Greedy Agglomerative Clustering orK-means using the accurate distance measure, but with therestriction that we do not calculate the distance between twopoints that never appear in the same canopy,
we assumetheir distance to be inﬁnite. For example, if all items aretrivially placed into a single canopy, then the second roundof clustering degenerates to traditional, unconstrained clus-tering with an expensive distance metric. If, however, thecanopies are not too large and do not overlap too much,then a large number of expensive distance measurementswill be avoided, and the amount of computation requiredfor clustering will be greatly reduced. Furthermore, if theconstraints on the clustering imposed by the canopies still in-clude the traditional clustering solution among the possibil-ities, then the canopies procedure may not lose any cluster-ing accuracy, while still increasing computational eﬃciencysigniﬁcantly.We can state more formally the conditions under which thecanopies procedure will perfectly preserve the results of tra-ditional clustering. If the underlying traditional clusterer isK-means, Expectation-Maximization or Greedy Agglomer-ative Clustering in which distance to a cluster is measuredto the centroid of the cluster, then clustering accuracy willbe preserved exactly when:For every traditional cluster, there exists a canopysuch that all elements of the cluster are in thecanopy.If instead weperform Greedy Agglomerative Clustering clus-tering in which distance to a cluster is measure to the closestpoint in the cluster, then clustering accuracy will be pre-served exactly when:For every cluster, there exists a set of canopiessuch that the elements of the cluster “connect”the canopies.