You are on page 1of 6

6.

K Means Clustering

Introduction

Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups
in the data such that data points in the same subgroup (cluster) are very similar while data
points in different clusters are very different. In other words, we try to find homogeneous
subgroups within the data such that data points in each cluster are as similar as possible
according to a similarity measure such as Euclidean-based distance or correlation-based distance.
The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups
of samples based on features or on the basis of samples where we try to find subgroups of
features based on samples. We’ll cover here clustering based on features. Clustering is used in
market segmentation; where we try to find customers that are similar to each other whether in
terms of behaviors or attributes, image segmentation/compression; where we try to group
similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we


don’t have the ground truth to compare the output of the clustering algorithm to the true labels
to evaluate its performance. We only want to try to investigate the structure of the data by
grouping the data points into distinct subgroups.

Types of Clustering

Broadly speaking, clustering can be divided into two subgroups:

• Hard/Exclusive Clustering: In hard clustering, each data point either belongs to a cluster
completely or not. For example, in the above example each customer is put into one group
out of the 10 groups. k-means clustering is a type of exclusive clustering.
• Soft/ Overlapping Clustering: In soft clustering, instead of putting each data point into a
separate cluster, a probability or likelihood of that data point to be in those clusters is
assigned. Here, an item can belong to multiple clusters with different degree of association
among each cluster. Fuzzy C-means algorithm is based on overlapping clustering.

• Hierarchical Clustering: In hierarchical clustering, the clusters are not formed in a single step
rather it follows series of partitions to come up with final clusters. It looks like a tree as visible
in the image.

Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are
plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among data
points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are
used popularly, let’s look at them in detail:
• Connectivity models: As the name suggests, these models are based on the notion that the
data points closer in data space exhibit more similarity to each other than the data points
lying farther away. These models can follow two approaches. In the first approach, they
start with classifying all data points into separate clusters & then aggregating them as the
distance decreases. In the second approach, all data points are classified as a single cluster
and then partitioned as the distance increases. Also, the choice of distance function is
subjective. These models are very easy to interpret but lacks scalability for handling big
datasets. Examples of these models are hierarchical clustering algorithm and its variants.
• Centroid models: These are iterative clustering algorithms in which the notion of similarity
is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering
algorithm is a popular algorithm that falls into this category. In these models, the no. of
clusters required at the end have to be mentioned beforehand, which makes it important to
have prior knowledge of the dataset. These models run iteratively to find the local optima.
• Distribution models: These clustering models are based on the notion of how probable is it
that all data points in the cluster belong to the same distribution (For example: Normal,
Gaussian). These models often suffer from overfitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal distributions.
• Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data
points within these regions in the same cluster. Popular examples of density models are
DBSCAN and OPTICS.

Kmeans Algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined
distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.
It tries to make the inter-cluster data points as similar as possible while also keeping the clusters
as different (far) as possible. It assigns data points to a cluster such that the sum of the squared
distance between the data points and the cluster’s centroid (arithmetic mean of all the data
points that belong to that cluster) is at the minimum. The less variation we have within clusters,
the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

1. Specify number of clusters K.


2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points
for the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.
a. Compute the sum of the squared distance between data points and all centroids.
b. Assign each data point to the closest cluster (centroid).
c. Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster.
d.

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-
D space.
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.

3. Compute cluster centroids : The centroid of data points in the red cluster is shown using
red cross and those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid : Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.

6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th
and 5th steps until we’ll reach global optima. When there will be no further switching of
data points between two clusters for two successive repeats. It will mark the termination
of the algorithm if not explicitly mentioned.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-
step is assigning the data points to the closest cluster. The M-step is computing the centroid of
each cluster. Below is a break down of how we can solve it mathematically.
The objective function is:

where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid of xi’s
cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we
minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and
update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after
the cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.
And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

• Since clustering algorithms including kmeans use distance-based measurements to


determine the similarity between data points, it’s recommended to standardize the data
to have a mean of zero and a standard deviation of one since almost always the features
in any dataset would have different units of measurements such as age vs income.
• Given kmeans iterative nature and the random initialization of centroids at the start of
the algorithm, different initializations may lead to different clusters since kmeans
algorithm may stuck in a local optimum and may not converge to global optimum.
Therefore, it’s recommended to run the algorithm using different initializations of
centroids and pick the results of the run that that yielded the lower sum of squared
distance.
• Assignment of examples isn’t changing is the same thing as no change in within-cluster
variation:

References

[1] https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-
and-drawbacks-aa03e644b48a
[2] https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-
methods-of-clustering/
[3] https://www.edureka.co/blog/k-means-clustering-algorithm/
[4] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
[5] https://www.kaggle.com/janithwanni/old-faithful

You might also like