# Welcome back

## Find a book, put up your feet, stay awhile

Sign in with Facebook

Sorry, we are unable to log you in via Facebook at this time. Please try again later.

or

Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more

Download

Standard view

Full view

of .

Look up keyword

Like this

Share on social networks

1Activity

×

0 of .

Results for: No results containing your search query

P. 1

Canopy Kdd00Ratings: (0)|Views: 22|Likes: 0

Published by Shanti Prasad

See more

See less

https://www.scribd.com/doc/24442376/Canopy-Kdd00

12/23/2009

text

original

Efﬁcient ClusteringofHigh-DimensionalData Setswith Applicationto ReferenceMatching

Andrew McCallum

‡†‡

WhizBang! Labs - Research4616 Henry StreetPittsburgh, PA USA

mccallum@cs.cmu.eduKamal Nigam

††

School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA USA

knigam@cs.cmu.eduLyle H. Ungar

∗∗

Computer and Info. ScienceUniversity of PennsylvaniaPhiladelphia, PA USA

ungar@cis.upenn.edu

ABSTRACT

Many important problems involve clustering large datasets.Although naive implementations of clustering are computa-tionally expensive, there are established eﬃcient techniquesfor clustering when the dataset has either (1) a limited num-ber of clusters, (2) a low feature dimensionality, or (3) asmall number of data points. However, there has been muchless work on methods of eﬃciently clustering datasets thatare large in all three ways at once—for example, havingmillions of data points that exist in many thousands of di-mensions representing many thousands of clusters.We present a new technique for clustering these large, high-dimensional datasets. The key idea involves using a cheap,approximate distance measure to eﬃciently divide the datainto overlapping subsets we call

canopies

. Then cluster-ing is performed by measuring exact distances only betweenpoints that occur in a common canopy. Using canopies, largeclustering problems that were formerly impossible becomepractical. Under reasonable assumptions about the cheapdistance metric, this reduction in computational cost comeswithout any loss in clustering accuracy. Canopies can beapplied to many domains and used with a variety of cluster-ing approaches, including Greedy Agglomerative Clustering,K-means and Expectation-Maximization. We present ex-perimental results on grouping bibliographic citations fromthe reference sections of research papers. Here the canopyapproach reduces computation time over a traditional clus-tering approach by more than an order of magnitude anddecreases error in comparison to a previously used algorithmby 25%.

Categories and Subject Descriptors

I.5.3[

Pattern Recognition

]: Clustering; H.3.3[

InformationStorage andRetrieval

]: Information Search and Retrieval—

Clustering

1. INTRODUCTION

Unsupervised clustering techniques have been applied tomany important problems. By clustering patient records,health care trends are discovered. By clustering addresslists, duplicate entries are eliminated. By clustering astro-nomical data, new classes of stars are identiﬁed. By cluster-ing documents, hierarchical organizations of information arederived. To address these applications, and many others, avariety of clustering algorithms have been developed.However, traditional clustering algorithms become compu-tationally expensive when the data set to be clustered islarge. There are three diﬀerent ways in which the data setcan be large: (1) there can be a large number of elements inthe data set, (2) each element can have many features, and(3) there can be many clusters to discover. Recent advancesin clustering algorithms have addressed these eﬃciency is-sues, but only partially. For example, KD-trees [15] providefor eﬃcient EM-style clustering of many elements, but re-quire that the dimensionality of each element be small. An-other algorithm [3] eﬃciently performs K-means clusteringby ﬁnding good initial starting points, but is not eﬃcientwhen the number of clusters is large. There has been al-most no work on algorithms that work eﬃciently when thedata set is large in all three senses at once—when there aremillions of elements, many thousands of features, and manythousands of clusters.This paper introduces a technique for clustering that is eﬃ-cient when the problem is large in all of these three ways atonce. The key idea is to perform clustering in two stages,ﬁrst a rough and quick stage that divides the data into over-lapping subsets we call “canopies,” then a more rigorousﬁnal stage in which expensive distance measurements areonly made among points that occur in a common canopy.This diﬀers from previous clustering methods in that it usestwo diﬀerent distance metrics for the two stages, and formsoverlapping regions.The ﬁrst stage can make use of extremely inexpensive meth-ods for ﬁnding data elements near the center of a region.Many proximity measurement methods, such as the invertedindex commonly used in information retrieval systems, arevery eﬃcient in high dimensions and can ﬁnd elements nearthe query by examining only a small fraction of a data set.

Variants of the inverted index can also work for real-valueddata.Once the canopies are built using the approximate distancemeasure, the second stage completes the clustering by run-ning a standard clustering algorithm using a rigorous, andthus more expensive distance metric. However, signiﬁcantcomputation is saved by eliminating all of the distance com-parisons among points that do not fall within a commoncanopy. Unlike hard-partitioning schemes such as “block-ing” [13] and KD-trees, the overall algorithm is tolerant toinaccuracies in the approximate distance measure used tocreate the canopies because the canopies may overlap witheach other.From a theoretical standpoint, if we can guarantee certainproperties of the inexpensive distance metric, we can showthat the original, accurate clustering solution can still berecovered with the canopies approach. In other words, if the inexpensive clustering does not exclude the solution forthe expensive clustering, then we will not lose any clusteringaccuracy. In practice, we have actually found small accuracy

increases

due to the combined usage of two distance metrics.Clustering based on canopies can be applied to many diﬀer-ent underlying clustering algorithms, including Greedy Ag-glomerative Clustering, K-means, and Expectation-Maxim-ization.This paper presents experimental results in which we applythe canopies method with Greedy Agglomerative Clusteringto the problem of clustering bibliographic citations from thereference sections of computer science research papers. The

Cora

research paper search engine [11] has gathered over amillion such references referring to about a hundred thou-sand unique papers; each reference is a text segment exist-ing in a vocabulary of about hundred thousand dimensions.Multiple citations to the same paper diﬀer in many ways,particularly in the way author, editor and journal names areabbreviated, formatted and ordered. Our goal is to clusterthese citations into sets that each refer to the same article.Measuring accuracy on a hand-clustered subset consisting of about 2000 references, we ﬁnd that the canopies approachspeeds the clustering by an order of magnitude while alsoproviding a modest improvement in clustering accuracy. Onthe full data set we expect the speedup to be ﬁve orders of magnitude.

2. EFFICIENT CLUSTERINGWITH CANOPIES

The key idea of the canopy algorithm is that one can greatlyreduce the number of distance computations required forclustering by ﬁrst cheaply partitioning the data into over-lapping subsets, and then only measuring distances amongpairs of data points that belong to a common subset.The canopies technique thus uses two diﬀerent sources of information to cluster items: a cheap and approximate sim-ilarity measure (e.g., for household address, the proportionof words in common between two address) and a more ex-pensive and accurate similarity measure (e.g., detailed ﬁeld-by-ﬁeld string edit distance measured with tuned transfor-mation costs and dynamic programming).We divide the clustering process into two stages. In the ﬁrststage we use the cheap distance measure in order to cre-ate some number of overlapping subsets, called “canopies.”A canopy is simply a subset of the elements (

i.e.

datapoints or items) that, according to the approximate simi-larity measure, are within some distance threshold from acentral point. Signiﬁcantly, an element may appear undermore than one canopy, and every element must appear in atleast one canopy. Canopies are created with the intentionthat points not appearing in any common canopy are farenough apart that they could not possibly be in the samecluster. Since the distance measure used to create canopiesis approximate, there may not be a guarantee of this prop-erty, but by allowing canopies to overlap with each other, bychoosing a large enough distance threshold, and by under-standing the properties of the approximate distance mea-sure, we can have a guarantee in some cases.The circles with solid outlines in Figure 1 show an exampleof overlapping canopies that cover a data set. The methodby which canopies such as these may be created is describedin the next subsection.In the second stage, we execute some traditional cluster-ing algorithm, such as Greedy Agglomerative Clustering orK-means using the accurate distance measure, but with therestriction that we do not calculate the distance between twopoints that never appear in the same canopy,

i.e.

we assumetheir distance to be inﬁnite. For example, if all items aretrivially placed into a single canopy, then the second roundof clustering degenerates to traditional, unconstrained clus-tering with an expensive distance metric. If, however, thecanopies are not too large and do not overlap too much,then a large number of expensive distance measurementswill be avoided, and the amount of computation requiredfor clustering will be greatly reduced. Furthermore, if theconstraints on the clustering imposed by the canopies still in-clude the traditional clustering solution among the possibil-ities, then the canopies procedure may not lose any cluster-ing accuracy, while still increasing computational eﬃciencysigniﬁcantly.We can state more formally the conditions under which thecanopies procedure will perfectly preserve the results of tra-ditional clustering. If the underlying traditional clusterer isK-means, Expectation-Maximization or Greedy Agglomer-ative Clustering in which distance to a cluster is measuredto the centroid of the cluster, then clustering accuracy willbe preserved exactly when:For every traditional cluster, there exists a canopysuch that all elements of the cluster are in thecanopy.If instead weperform Greedy Agglomerative Clustering clus-tering in which distance to a cluster is measure to the closestpoint in the cluster, then clustering accuracy will be pre-served exactly when:For every cluster, there exists a set of canopiessuch that the elements of the cluster “connect”the canopies.

ACEBD

Figure 1: An example of four data clusters and thecanopies that cover them. Points belonging to thesame cluster are colored in the same shade of gray.The canopies were created by the method outlinedin section 2.1. Point A was selected at randomand forms a canopy consisting of all points withinthe outer (solid) threshold. Points inside the in-ner (dashed) threshold are excluded from being thecenter of, and forming new canopies. Canopies forB, C, D, and E were formed similarly to A. Notethat the optimality condition holds: for each clus-ter there exists at least one canopy that completelycontains that cluster. Note also that while there issome overlap, there are many points excluded byeach canopy. Expensive distance measurements willonly be made between pairs of points in the samecanopies, far fewer than all possible pairs in the dataset.

We have found in practice that it is not diﬃcult to create in-expensive distance measures that nearly always satisfy these“canopy properties.”

2.1 Creating Canopies

In most cases, a user of the canopies technique will be able toleverage domain-speciﬁc features in order to design a cheapdistance metric and eﬃciently create canopies using the met-ric. For example, if the data consist of a large number of hospital patient records including diagnoses, treatments andpayment histories, a cheap measure of similaritybetween thepatients might be “1” if they have a diagnosis in commonand “0” if they do not. In this case canopy creation is triv-ial: people with a common diagnosis fall in the same canopy.(More sophisticated versions could take account of the hier-archical structure of diagnoses such as ICD9 codes, or couldalso include secondary diagnoses.) Note, however, that peo-ple with multiple diagnoses will fall into multiple canopiesand thus the canopies will overlap.Often one or a small number of features suﬃce to buildcanopies, even if the itemsbeing clustered (e.g. the patients)have thousands of features. For example, bibliographic ci-tations might be clustered with a cheap similarity metricwhich only looks at the last names of the authors and theyear of publication, even though the whole text of the ref-erence and the article are available.At other times, especially when the individual features arenoisy, one may still want to use all of the many features of an item. The following section describes a distance metricand method of creating canopies that often provides goodperformance in such cases. For concreteness, we considerthe case in which documents are the items and words inthe document are the features; the method is also broadlyapplicable to other problems with similar structure.

2.1.1 A Cheap Distance Metric

All the very fast distance metrics for text used by searchengines are based on the inverted index. An inverted indexis a sparse matrix representation in which, for each word,we can directly access the list of documents containing thatword. When we want to ﬁnd all documents close to a givenquery, we need not explicitly measure the distance to alldocuments in the collection, but need only examine the listof documents associated with each word in the query. Thegreat majority of the documents, which have no words incommon with the query, need never be considered. Thus wecan use an inverted index to eﬃciently calculate a distancemetric that is based on the number of words two documentshave in common.Given the above distance metric, one can create canopiesas follows. Start with a list of the data points in any order,and with two distance thresholds,

T

1

and

T

2

, where

T

1

> T

2

.(These thresholds can be set by the user, or, as in our ex-periments, selected by cross-validation.) Pick a point oﬀ the list and approximately measure its distance to all otherpoints. (This is extremely cheap with an inverted index.)Put all points that are within distance threshold

T

1

into acanopy. Remove from the list all points that are within dis-tance threshold

T

2

. Repeat until the list is empty. Figure 1shows some canopies that were created by this procedure.The idea of an inverted index can also be applied to high-dimensional real-valued data. Each dimension would be dis-cretized into some number of bins, each containing a bal-anced number of data points. Then each data point is ef-fectively turned into a “document” containing “words” con-sisting of the unique bin identiﬁers for each dimension of thepoint. If one is worried about edge eﬀects at the boundariesbetween bins, we can include in a data point’s document theidentiﬁers not only of the bin in which the point is found,but also the bins on either side. Then, as above, a cheapdistance measure can be based on the the number of binidentiﬁers the two points have in common. A similar proce-dure has been used previously with success [8].

2.2 Canopies with Greedy AgglomerativeClustering

Greedy Agglomerative Clustering (GAC) is a common clus-tering technique used to group items together based on sim-ilarity. In standard greedy agglomerative clustering, we aregiven as input a set of items and a means of computing thedistance (or similarity) between any of the pairs of items.

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

© Copyright 2015 Scribd Inc.

Language:

Choose the language in which you want to experience Scribd:

Sign in with Facebook

Sorry, we are unable to log you in via Facebook at this time. Please try again later.

or

Password Reset Email Sent

Sign up with Facebook

Sorry, we are unable to log you in via Facebook at this time. Please try again later.

or

By joining, you agree to our

read free for one month

Personalized recommendationsbased on books you love

Syncing across all your devices

Join with Facebook

or Join with EmailSorry, we are unable to log you in via Facebook at this time. Please try again later.

Already a member? Sign in.

By joining, you agree to our

to download

Personalized recommendationsbased on books you love

Syncing across all your devices

Continue with Facebook

Sign inJoin with emailSorry, we are unable to log you in via Facebook at this time. Please try again later.

By joining, you agree to our

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd