L18 K Means

BITS Pilani
BITS Pilani Dr.Aruna Malapati

Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus
K-Means Clustering
Today’s Learning objective
• List the clustering algorithms
• Define K-Means clustering algorithm
• List and resolve issues with K-Means clustering
BITS Pilani, Hyderabad Campus

Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• Density-based clustering

K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

Importance of Choosing
Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

K Means clustering (section
9.1 Bishop page 454)
• Given the data set {x1, . . . , xN} where each Xi is a D-

dimensional Euclidean variable.
• Our goal is to partition the data set into some number K of
clusters.
• μk, where k = 1, . . . , K, in which μk is a prototype associated
with the kth cluster (representing the centres of the clusters).
• Our goal is then to find an assignment of data points to clusters,
as well as a set of vectors {μk}, such that the sum of the
squares of the distances of each data point to its closest vector
μk, is a minimum.

K Means clustering
• For each data point xn, we introduce a corresponding set

of binary indicator variables rnk ∈ {0, 1}, where k =
1, . . . , K describing which of the K clusters the data
point xn is assigned to, so that if data point xn is
assigned to cluster k then rnk = 1, and rnj = 0 for j = k.

K-means Clustering
• We can then define an objective function, which

represents the sum of the squares of the distances of
each data point to its assigned vector μk
• Our goal is to find values for the {rnk} and the {μk} so as to
minimize J.

Importance of Choosing
Initial Centroids
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

Solution to random
initialization
• Choose initial centroids and perform multiple runs and

select the set of clusters with minimum SSE.
• This success of this will depend on data set and number
of clusters chosen.

Handling Empty Clusters
• Basic K-means algorithm can yield empty clusters
• Several strategies
• Choose a point and assign it to the cluster
• Choose the point that contributes most to SSE
• Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be

repeated several times.

Updating Centers
Incrementally
• In the basic K-means algorithm, centroids are updated after
all points are assigned to a centroid
• An alternative is to update the centroids after each

assignment (incremental approach)
– Each assignment updates zero or two centroids
– More expensive
– Never get an empty cluster
– Can use “weights” to change the impact

Pre-processing and Post-
processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low
SSE
– Can use these steps during the clustering process
• ISODATA

Bisecting K-means
• Bisecting K-means algorithm

– Variant of K-means that can produce a partitional or a hierarchical clustering

Bisecting K-means Example

Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains outliers.

Limitations of K-means:
Differing Sizes
Original Points K-means (3 Clusters)

Differing Density

Non-globular Shapes

Problems with K-Means
Clustering
• K-Means Clustering works only for clusters which represent
gaussian distributions. Hence, we cannot use K-Means
Clustering for finding complex clusters or non-convex clusters.
• The K-Means Algorithm is very sensitive to initialization, and

hence one must be careful while initializing the cluster means.
• The Algorithm can get stuck at a local optima, finding clusters

different from those originally wanted. This is also a factor
affected by the initialization of the cluster means.

K-medoids Clustering
Algorithm

PAM (Partitioning Around
Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987)
• Use real object to represent the cluster
• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected
object I, calculate the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most
similar representative object
• repeat steps 2-3 until there is no change

A Typical K-Medoids
Algorithm (PAM)

Computation Complexity
for K-Means
• In each iteration,
• It costs O(Kn) to compute the distance between each

of n examples and K cluster means
• It costs O(n) to update the cluster means by adding
each example to one cluster
• Assume t iterations are done before terminating the
algorithm, the computational complexity is O(tKn)

K-Means/Median/Mode/Medoid
Clustering complexity

Take home message
• K-means algorithm is a simple yet popular method for

clustering analysis.
• Its performance is determined by initialization and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
• CLARA: extension to deal with large data sets
• Mixture models (EM algorithm): handling uncertainty of clusters

L18 K Means

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L18 K Means

Uploaded by

Copyright:

Available Formats

BITS Pilani

BITS Pilani Dr.Aruna Malapati

• List the clustering algorithms

• Define K-Means clustering algorithm

• List and resolve issues with K-Means clustering

BITS Pilani, Hyderabad Campus

• K-means and its variants

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

BITS Pilani, Hyderabad Campus

• Given the data set {x1, . . . , xN} where each Xi is a D-

BITS Pilani, Hyderabad Campus

• For each data point xn, we introduce a corresponding set

BITS Pilani, Hyderabad Campus

• We can then define an objective function, which

BITS Pilani, Hyderabad Campus

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

BITS Pilani, Hyderabad Campus

• Choose initial centroids and perform multiple runs and

BITS Pilani, Hyderabad Campus

• Basic K-means algorithm can yield empty clusters

• Choose a point and assign it to the cluster

• Choose the point that contributes most to SSE

• Choose a point from the cluster with the highest SSE

– If there are several empty clusters, the above can be

BITS Pilani, Hyderabad Campus

• An alternative is to update the centroids after each

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• Bisecting K-means algorithm

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• K-means has problems when clusters are of differing

• K-means has problems when the data contains outliers.

BITS Pilani, Hyderabad Campus

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus

Original Points K-means (2 Clusters)

BITS Pilani, Hyderabad Campus

• The K-Means Algorithm is very sensitive to initialization, and

• The Algorithm can get stuck at a local optima, finding clusters

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• It costs O(Kn) to compute the distance between each

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• K-means algorithm is a simple yet popular method for

You might also like