You are on page 1of 22

Coals by the end of this chapter

• what is meant by clustering.


• common applications of clustering analysis.
• the different methods for clustering analysis.
• the implementation of clustering methods in Python.
• the advantages and limitations of each clustering method.
▪ Clustering means to gather data records into natural groups (i.e., clusters) of
similar samples according to predefined similarity/dissimilarity metrics, which
results in extracting a set of useful information about the given dataset.
▪ The contents of any cluster should be similar to each other, which is called high
intra-cluster similarity.
▪ Conversely, the contents of any cluster should be very different (i.e., dissimilar)
from the contents of other clusters, which is called high inter-cluster separation.
▪ The similarity/dissimilarity metric that is routinely utilized in clustering analysis is
a form of distance function between each pair of data records (e.g., A and B).
▪ Therefore, the distance measures how close A and B are to each other, and a
decision is made whether to combine A and B in one cluster.
▪ There are two commonly implemented simple forms of distance function, which are
the Euclidean distance and the Manhattan distance.
▪ The Euclidean distance for two dimensional datasets (i.e., having two features) is
calculated as

▪ Manhattan distance
Question:
Compute the Euclidean distance and the Manhattan distance
Question:
Compute the Euclidean distance and the Manhattan distance
Question:
Compute the Euclidean distance and the Manhattan distance
1.Centroid-based clustering methods
2.Gaussian mixtures models clustering methods
3.Hierarchical clustering methods
4.Density based clustering methods
▪ Centroid-based clustering searches for a pre-determined number of clusters
within an unlabeled and possibly multidimensional dataset
▪ Each data record is assigned to one, and only one.
▪ The rule is that the distance between a data record and each of the cluster's
centroids is calculated, and this data record is assigned to the cluster achieving the
minimum distance.
▪ K-Means Clustering: a centroid-based clustering approach commonly used in
practice.
▪ The approach consists of three main steps: initialization, assignment, and update
step.
▪ In the initialization step, the number of clusters are assumed (i.e., k is
predetermined), and the centroid of each cluster is randomly defined.
▪ The simple procedure to define the initial placement of a cluster’s centroid is to
locate it at one of the given data records
▪ In the assignment step, the clusters are formed by connecting each data record
with its nearest centroid.
▪ In the assignment step, the clusters are formed by connecting each data record
with its nearest centroid.
▪ a more accurate centroid of each cluster is calculated as the mean point of its
included data records. Then, the assignment and update steps are repeated until
convergence.
▪ In the assignment step, the clusters are formed by connecting each data record
with its nearest centroid.
▪ a more accurate centroid of each cluster is calculated as the mean point of its
included data records. Then, the assignment and update steps are repeated until
convergence.
Given:

This dataset can be displayed in a cartesian plane

▪ In the initialization step, we will assume three clusters


for this dataset (i.e., K = 3 ), and select their initial
centroids to be C0 = {72,24}, C1 = {12,39}, C3 = {52,70}
and .
Now, the update step, where the new centroid of each
cluster is calculated as the mean of the included data
records, as shown below
New centroids differ from the old ones, a second
iteration must be performed through the assignment
step.
Therefore, the distance of each data record is
calculated to the new centroid of each cluster to
investigate whether a data record needs to move to a
different cluster.
These distance calculations and the new assignment of
clusters are given in the table.
although the centroids are changed, the contents of
each cluster remain as they are.

This implies that there is no need for an update step


nor any further iterations because there will be no
modification to the recently calculated centroids.

Convergence is successfully obtained and the k-means


model is determined with its three clusters centered

C0’ = {62.33,15.83}, C1*’= {23.42,44.57}, C3’ = {50,63.16}


Convergence is successfully obtained and the k-means
model is determined with its three clusters centered
C0’ = {62.33,15.83}, C1*’= {23.42,44.57}, C3’ = {50,63.16}

The resulting k-means model can be applied to predict


the associated cluster(s) of some test data records.

For example, if you need to know the cluster of a data


record p with (x=20, and y =20)

you first calculate its Euclidean distance to each


cluster’s centroid as follows:
For example, if you need to know the cluster of a data
record p with (x=20, and y =20)

you first calculate its Euclidean distance to each


cluster’s centroid as follows:

Therefore, the data record is assigned to cluster “1.”


Please try the data record q with (x=60, and y =40)

You might also like