You are on page 1of 18

` ` ` ` ` ` ` ` `

What is K-medoids? K-means vs. K-medoids Algorithm Example Weaknesses Complexity PAM CLARA Conclusions

K partitonal clustering
algorithm. Breaks the dataset into K clusters based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point

Medoid - point in the


dataset whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster.

Kaufman and Rousseeuw, 1987

` `

centroid (mean) medoid Improvement brought by K-medoids:


K-means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data. K-medoids is more robust to outliners than K-means therefore results in more quality clustering

When to use K-medoids over K-means? -scenarios where an imaginary point such as a mean (centroid) cannot be defined:
3-D trajectories the gene expression context

1. Randomly select K points as the initial medoids. 2. Assign all points to the closest medoid. 3. See if any other point is a better medoid (i.e, has the lowest average distance to all other points)
Update step: For each medoid m and each data point o associated to m swap m and o and compute the total cost of the configuration (that is, the average dissimilarity of o to all the data points associated to m). Select the medoid o with the lowest cost of the configuration.

4. Repeat steps 2 and 3 until the medoids dont change.

` `

K must be known in advance Since medoids are chosen randomly, results may vary from run to run due to the first choice of k medoids. The algorithm is incompatible with non-convex data set. Finding a better medoid involves comparing all pairs of medoid and non-medoid points and is relatively inefficient

O( K *(N-K) * K*(N-K) ) => O(N2)


The third loop iterates through each non-medoid object in order to compute the distance of swapping the medoid and nonmedoid

The second loop iterates through the N-K objects in the non-medoids list

The first loop iterates through the K medoids.

` `

PAM (Partitioning Around Medoids, 1987) Differs from standard K-medoids in the Update step:
x Randomly select a non-medoid object x Compute the total cost of the configuration obtained by swapping omedoid with orandom x If oldCost-newCost < 0 then swap omedoid with orandom

`
`

Complexity: O( K *(N-K)2 )
Weakness: PAM works effectively for small data sets, but does not scale well for large data sets

` `

CLARA (Clustering Large Applications, 1990) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness:
Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

` `

Each cluster center is represented by a centrally located point (the medoid) rather than a prototype point. Pros
more robust to noise and outliers as compared to kmeans independent of data-order compared to standard Kmeans clustering. provides better class separation than k-means

Cons
Can be computationally costlier compared to K-means O(N2) vs. O(N) per iteration