You are on page 1of 12

DRAFT a nal version will be posted shortly

COS 424: Interacting with Data

Lecturer: David Blei Scribe: Vaneet Aggarwal Lecture # 9 March 6, 2007

Review of Clustering and K-means

In the previous lecture, we saw that clustering automatically segments data into groups of similar points. This is useful to organize data automatically, to understand the hidden structures in some data and to represent high-dimensional data in a low-dimensional space. In contrast to classication where we have descriptive statistics of data, these problems are solved widely even though there is no label information available as is there in classication. In classication, we have data as well as a label attached to each data point, which is not there in clustering. We discussed k-means algorithm in the previous lecture. In the k-means algorithm, we rst choose initial cluster means, and then repeat the procedure of assigning each data point to its closest mean and recomputing means according to this new assignment till the assignments do not change. We also saw an example of k-means where we decided to divide the data into 4 clusters where we scattered the initial cluster means all over the plane randomly. The k-means algorithm nds the local minimum of the objective function which is the sum of the squared distance of each data point to its assigned mean. Mean locations in the example is shown by boxes in the slides. If you see the objective function, it goes down with the iterations.

How do we choose number of clusters k?

Choosing k is a nagging problem in cluster analysis, and there is no agreed upon solution. Sometimes, people just declare it arbitrarily. Sometimes the problem determines k. For image we may have memory constraint that decide the limit on k. We may also have a constraint on the amount of distortion that we can accept and have no memory constraint which also puts a limit on the value of k we can choose. In another examples of clustering consumers, constraint can also be number of salespeople available. We try to choose a natural value for the number of clusters, but in general this notion is not well-dened. Now we discuss what happens when the number of clusters increase. Let us consider the example in the slides where there are 4 clusters(Figure 1 ). There are many options for the fth group. 1. One option is that the fth group is small or empty. In this case, the objective function remains almost the same. 2. Fifth cluster center is in the center of the gure. In this case, fth cluster draws points from all the other clusters. In this case, the objective function decreases as the points would not have otherwise shifted to the new mean and we would still be in the rst case. But, there are many points that are far from the cluster means. 3. One of the cluster subdivides into two. In this, we decrease the objective function since all the points come closer to the means. Also, since all points will be closer to the means, this is a better option than the previous 2 since more data points are aected in this case than in the second case.

OBJ=9.97e+00
q q qq q q q q q q q q q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qqq q q q q qq qq q q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q qq qq q q qq q q q q q q q q qq q q q q q q qq qq q q q q q q qq q qq q q q q q q q q q q q q q q q qq q q q qq q q qq q q q qq q q q q q q q qq qq q q q q q q

1.0

q q

q q q

0.6

0.8

q q q q q q q q q q q q q q q q q q q q q q qq q qq q qq q q q q q q qq q q q qq q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q qq q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q

0.4

q q q q q

0.2

q q

q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q qq q q q qq q q q q q q q qq q q qq q q q q q q qqq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: Division of data into four clusters

OBJ=8.81e+00
q q qq q q q q q q q q q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qqq q q q q qq qq q q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q qq qq q q qq q q q q q q q q qq q q q q q q qq qq q q q q q q qq q qq q q q q q q q q q q q q q q q qq q q q qq q q qq q q q qq q q q q q q q qq qq q q q q q q

1.0

q q

q q q

0.6

0.8

q q q q q q q q q q q q q q q q q q q q q q qq q qq q qq q q q q q q qq q q q qq q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q qq q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q

0.4

q q q q q

0.2

q q

q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q qq q q q qq q q q q q q q qq q q qq q q q q q q qqq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2: Division of data into ve clusters

OBJ=8.06e+00
q q qq q q q q q q q q q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qqq q q q q qq qq q q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q qq qq q q qq q q q q q q q q qq q q q q q q qq qq q q q q q q qq q qq q q q q q q q q q q q q q q q qq q q q qq q q qq q q q qq q q q q q q q qq qq q q q q q q

1.0

q q

q q q

0.6

0.8

q q q q q q q q q q q q q q q q q q q q q q qq q qq q qq q q q q q q qq q q q qq q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q qq q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q

0.4

q q q q q

0.2

q q

q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q qq q q q qq q q q q q q q qq q q qq q q q q q q qqq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Division of data into six clusters

OBJ=7.40e+00
q q qq q q q q q q q q q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qqq q q q q qq qq q q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q qq qq q q qq q q q q q q q q qq q q q q q q qq qq q q q q q q qq q qq q q q q q q q q q q q q q q q qq q q q qq q q qq q q q qq q q q q q q q qq qq q q q q q q

1.0

q q

q q q

0.6

0.8

q q q q q q q q q q q q q q q q q q q q q q qq q qq q qq q q q q q q qq q q q qq q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q qq q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q

0.4

q q q q q

0.2

q q

q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q qq q q q qq q q q q q q q qq q q qq q q q q q q qqq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4: Division of data into seven clusters

OBJ=6.33e+00
q q qq q q q q q q q q q q q qq q q qq q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q qq q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q qq qq q q q q q q q qq q q q q qq q q q q q qq qq q q q q q q qq q qq q q q q q q q q q qq q q q q qq q q q qq q q qq q q q qq q q q q q q q qq qq q q q q q

1.0

q q

q q q

0.6

0.8

0.4

0.0

q q q q q q q q q q q q q q q q q qq q qq q qq q q q q q q qq q q q q q q q q qq q q q q q qq q q q q q q q q q q qq q qq q q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q

q q

q q

q q q q q q q

0.2

q q q q

q q q q q q q q q q qq qq q qq q q qq q q qq q qq q q q q q q q q qq q q q q q q q qq q q qq q q q q q q qqq q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5: Division of data into eight clusters We can see the eect of increase of the number of clusters from 4 to 5,6,7 and 8 respectively in gures 2, 3, 4 and 5 respectively. We nd that one of the clusters get subdivided into two in all these cases. When we plot the objective function against the number of clusters (Figure 6) , we nd a kink between k=3 and k=5. This is because the decrease in the objective function when k increases from 3 to 4 is much higher than the decrease in the objective function when k increases from 4 to 5. This suggests that 4 is the right number of clusters. Tibshirani in 2001 presented a method of nding this kink.

3
3.1

Some applications of k-means

Archeology

This example is taken from Spatial and Statistical Inference of Late Bronze Age Polities in the Southern Levant (Savage and Falconer) paper. The objective is to cluster the locations of archeological sites in Israel and to make inferences about political history based on the clusters. Number of clusters were chosen carefully with a complicated computational technique. The twenty-four clusters can be seen in gure 7. With these some speculations can be made, and they can be tested in actual going to the site. So, in a sense we can make some hypothesis using the clustering algorithm and must test them.

3.5

4.0

Log objective

2.5

3.0

q q q

2.0

q q

5 k

Figure 7: Clustering location of archeological sites in Israel

3.2

Computational Biology

This example is taken from Coping with cold: An integrative, multitissue analysis of the transciptome of a poikilothermic vertebrate (Gracey et al., 2004) paper. In this paper, carp to dierent levels of cold and genes were clustered based on their response in dierent tissues. The paper assumes 23 clusters without mentioning how it is chosen. The clustering

Figure 8: Clustering location of archeological sites in Israel is shown in Figure 8. The green color represents that the gene is over-expressed while the red-color means that the gene is under-expressed. As we can see from the gure, there are some patterns in dierent tissues. We also see that clustering is a useful summarization tool as we are able to represent so much information in one plot. We can get some hypothesis from the clustering, which we can test later.

3.3

Education

This example is taken from Teachers as Sources of Middle School Students Motivational Identity: Variable-Centered and Person-Centered Analytic Approaches (Murdock and Miller, 2003) paper. Survey results of 206 students are clustered and these clusters are used to identify groups to buttress an analysis of what aects motivation. The number of clusters were selected to get some nice hypothesis. This hypothesis can be then veried.

3.4

Sociology

his example is taken from Implications of Racial and Gender Dierences in Patterns of Adolescent Risk Behavior for HIV and other Sexually Transmitted Diseases (Halpert et

al., 2004) paper. Survey results of 13,998 students were clustered to understand patterns of drug abuse and sexual activity. Number of clusters were chosen for interpretability and stability, which means that they could interpret multiple k -means runs on dierent data in the same way. The paper draws the conclusion that patterns exist, which is obvious since the clusters were chosen to get nice results. Also, k -means will nd patterns everywhere!

Hierarchical clustering

Hierarchical clustering is a widely used data analysis tool. The main idea behind hierarchical clustering is to build a binary tree of the data that successively merges similar groups of points. Visualizing this tree provides a useful summary of the data. Recall that k -means or k -medoids requires the number of clusters k , an initial assignment of data to clusters and a distance measure between data d(xn , xm ). Hierarchical clustering only requires a measure of similarity between groups of data points. In this section, we will mainly talk about Agglomerative clustering.

4.1

Agglomerative clustering

The Agglomerative clustering algorithm can be given as: 1. Place each data point into its own singleton group 2. Repeat: iteratively merge the two closest groups 3. Until: all the data are merged into a single cluster We can also see an example in which the similarity measure is the average distance of points in the two groups. This example can be seen in the slides. Let us discuss some facts about the Agglomerative clustering algorithm. Each level of the resulting tree is a segmentation of the data. The algorithm results in a sequence of groupings. It is up to the user to choose a natural clustering from this sequence. Agglomerative clustering is monotonic in the sense that the similarity between merged clusters decreases monotonically with the level of the merge. We can also construct a dendrogram which is a useful summarization tool, part of why hierarchical clustering is popular. The method to plot a dendrogram is to plot each merge at the (negative) similarity between the two merged groups. This provides an interpretable visualization of the algorithm and data. Tibshirani et al. in 2001 said that groups that merge at high values relative to the merger values of their subgroups are candidates for natural clusters. We can see the dendrogram of example data in gure 9.

4.2

Group Similarity

Given a distance measure between points, the user has many choices for how to dene intergroup similarity. Three most popular choices are: Single-linkage : the similarity of the closest pair dSL (G, H ) =
iG,j H

min di,j

Cluster Dendrogram
120 Height 20 40 60 80 100

2904

2314

1723

220

dist(x) hclust (*, "complete")

Figure 9: dendrogram of example data Complete linkage : the similarity of the furthest pair dCL (G, H ) = max di,j
iG,j H

Group average : the average similarity between groups dGA = 4.2.1 1 NG NH di,j
iG j H

Properties of intergroup similarity

Single linkage can produce chaining, where a sequence of close observations in different groups cause early merges of those groups. For example, in gure 10, suppose that the earlier grouping groups the two circled parts. The next grouping will group the two grouped parts and the indivisual point will be left alone.

81 184 252 477

        

 

"! "!

Complete linkage has the opposite problem. It might not merge close groups because of outlier members that are far apart. For example, in gure 11, although groups 1 and 3 should have been clustered, but with complete linkage, groups 1 and 2 are actually clustered.

ED ED

t t utr utr ih ih gfgf sr sr b b b c b c p p x x p q p q d e d e yx yx wv wv

8 8 CBCB   A@A@   9898     32 32 1 0) 0) (' 0)(' "! "! 7676 54 54      &% &% \$#\$#  

IH IH WV WV X X YX YX a` a`

SR SR

UT UT

QP QP

GF GF
3

Figure 11: problem with complete linkage Group average represents a natural compromise, but depends on the scale of the similarities. Applying a monotone transformation to the similarities can change the results. 4.2.2 Caveats of intergroup similarity

Hierarchical clustering should be treated with caution. Dierent decisions about group similarities can lead to vastly dierent dendrograms. The algorithm imposes a hierarchical structure on the data, even data for which such structure is not appropriate.

4.3
4.3.1

Examples of Hierarchical clustering

Gene Expression Data Sets

This example is taken from Repeated Observation of Breast Tumor Subtypes in Independent Gene Expression Data Sets (Sorlie et al., 2003) paper. In this paper, hierarchical clustering of gene expression data led to new theories which can be tested in the lab later. In general, clustering is a cautious way that leads to new hypothesis which can be tested later. 4.3.2 Roger de Piles

This example is taken from The Balance of Roger de Piles (Studdert-Kennedy and Davenport, 1974) paper. Roger de Piles rated 57 paintings along dierent dimensions. The authors of the above paper clustered them using dierent methods, including hierarchical

clustering. Being art critics, they also discussed the dierent clusters. They perform analysis cautiously, and mention that The value of this analysis will depend on any interesting speculation it may provoke. 4.3.3 Australian Universities

This example is taken from Similarity Grouping of Australian Universities (Stanley and Reynlds, 1994) paper. In this paper, hierarchical clustering is used on Austrailian universities with the features such as # of sta in dierent departments, entry scores, funding and evaluations.

Figure 12: Dendrogram 1 for Australian Universities The two dendograms can be seen in Figures 12 and 13 respectively. These two dendograms are dierent. Also, on seeing Agglomeration coecient(Figure 14), the authors noticed that theres no kink and concluded that there is no cluster structure in Austrailian universities. The good thing about the paper is that it is a cautious interpretation of clustering, and the analysis of clustering is based on multiple subsets of the features. But, their conclusions are not good as the conclusion of we cant cluster Australian universities ignores all the algorithmic choices that were made. Another problem in the paper is that Euclidean distance is considered in the paper and there is no normalization. This would mean that some dimensions will dominate over others. 4.3.4 International Equity Markets

This example refers to the Comovement of International Equity Markets: A Taxonomic Approach (Panton et al., 1976) paper. In this paper, the data used is the weekly rates of 10

Figure 14: austrailian university agglomeration coecient

11

return for stocks in twelve countries. The authors ran agglometerative clustering year by year and interpreted the structure and examined the stability over dierent time periods. These dendograms over the period of time can be seen in Figure 15.

12