You are on page 1of 17

Introduction to Data

Science (CCPS521)
Session 8:
K-Mean Clustering
K-Mean Clustering
Clustering is grouping data based on similarity patterns
There are methods or algorithms that can be used in case clustering : K-Means Clustering, Affinity
Propagation, Mean Shift, Spectral Clustering, Hierarchical Clustering, DBSCAN, etc.
Source: https://arifromadhan19.medium.com/machine-learning-explanation-supervised-learning-unsupervised-learning-6d4c7f2bebb2

How do we know a point has the same group as another point?


As mentioned above : based on similarity pattern

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
k-means clustering is a method that aims to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean (cluster centers or
cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data
space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared
Euclidean distances), but not regular Euclidean distances. The mean optimizes squared errors,
whereas only the geometric median minimizes Euclidean distances. For instance, better
Euclidean solutions can be found using k-medians and k-medoids.
Source: https://en.wikipedia.org/wiki/K-means_clustering

In mathematics, a Voronoi diagram is a partition of a plane into regions close to each of a given
set of objects. In the simplest case, these objects are just finitely many points in the plane (called
seeds, sites, or generators). For each seed there is a corresponding region, called a Voronoi cell,
consisting of all points of the plane closer to that seed than to any other.
Source: https://en.wikipedia.org/wiki/Voronoi_diagram
K-Mean Clustering
How to know the pattern similarity with K-means clustering method?
Before answering that, let’s discuss what k-means is and how k-means does
clustering
THEORY — What is K-means ?
K-means clustering is a simple and elegant approach for partitioning a
data set into K distinct, non-overlapping clusters. To perform K-means
clustering, we must first specify the desired number of clusters K; then the
K-means algorithm will assign each observation to exactly one of the K
clusters

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
What is Clustering?

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
The K-means algorithm clusters data by trying to separate samples in n groups of
equal variance, minimizing a criterion known as the inertia or within-cluster sum-
of-squares. The K-means algorithm aims to choose centroid that minimise
the inertia, or within-cluster sum-of-squares criterion

How does it work?

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Step 1. Determine the value “K”, the value “K” represents the number of clusters.
in this case, we’ll select K=3. That is to say, we want to identify 3 clusters.
(Is there any way to determine the value of K? yes there is, but we’ll talk later about it).
Step 2. Randomly select 3 distinct centroid (new data points as cluster initialization)
for example — attempts 1. “K” is equal 3 so there are 3 centroid, in which case it
will be the cluster initialization

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Step 3. Measure the distance (euclidean distance) between each point and the
centroid. For example, measure the distance between first point and the centroid.

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Step 4. Assign the each point to the nearest cluster
For example, measure the distance between first point and the centroid.

Do the same treatment for the other unlabeled point, until we get this

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Step 5. Calculate the mean of each cluster as new centroid

Update the centroid with mean of each cluster


Step 6. Repeat step 3–5 with the new center of cluster

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Repeat until stop:
• Convergence. (No further changes)
• Maximum number of iterations.
Since the clustering did not change at all during the last iteration, we’re done.

Has this process been completed? of course not


Remember, the K-means algorithm aims to choose centroid that minimise the inertia, or within-cluster sum-of-
squares criterion. So, how to assess the results of this clustering? let’s continue
Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Step 7. Calculate the variance of each cluster

Since K-means clustering can’t “see” the best clustering, it is only option is to keep track of these clusters, and their
total variance, and do the whole thing over again with different starting points.

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Step 8. Repeat step 2–7 until get the lowest sum of variance
For example — attempts 2 with different random centroid

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?

For example — attempts 3 with different random centroid

Repeat until stop: Until we get the lowest sum of variance and pick those cluster as our result

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
How does it work?
Has this process been completed? Yes
The result of clustering is

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
K-means Clustering Overview

Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
References:
Python Data Science Handbook (In depth K-Mean Clustering): This is a very import source for understanding K-Mean
algorithm using Python https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
(Active login to my.torontomu.ca is required)
Staglianò, A., Ma, A., & Willis, G. (2017). Clustering and unsupervised learning. part 4, introduction to real-world
machine learning. O'Reilly. https://learning.oreilly.com/videos/clustering-and-unsupervised/9781492023951/
Albalate, A., & Minker, W. (2011;2013;). Semi-supervised and unsupervised machine learning: Novel strategies.
ISTE. https://doi.org/10.1002/9781118557693
https://learning.oreilly.com/library/view/semi-supervised-and-unsupervised/9781118586136/00_9781118586136_cover.html

Machine Learning with Weka - regression and clustering: https://www.youtube.com/watch?v=COpXNK0O8As&t=1200s

You might also like