CCPS521 WIN2023 Week08 K Mean Clustering

Introduction to Data
Science (CCPS521)
Session 8:
K-Mean Clustering
K-Mean Clustering
Clustering is grouping data based on similarity patterns
There are methods or algorithms that can be used in case clustering : K-Means Clustering, Affinity
Propagation, Mean Shift, Spectral Clustering, Hierarchical Clustering, DBSCAN, etc.
Source: https://arifromadhan19.medium.com/machine-learning-explanation-supervised-learning-unsupervised-learning-6d4c7f2bebb2
How do we know a point has the same group as another point?

As mentioned above : based on similarity pattern
Source: https://medium.com/data-folks-indonesia/step-by-step-to-understanding-k-means-clustering-and-implementation-with-sklearn-b55803f519d6
K-Mean Clustering
k-means clustering is a method that aims to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean (cluster centers or
cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data
space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared
Euclidean distances), but not regular Euclidean distances. The mean optimizes squared errors,
whereas only the geometric median minimizes Euclidean distances. For instance, better
Euclidean solutions can be found using k-medians and k-medoids.
Source: https://en.wikipedia.org/wiki/K-means_clustering
In mathematics, a Voronoi diagram is a partition of a plane into regions close to each of a given
set of objects. In the simplest case, these objects are just finitely many points in the plane (called
seeds, sites, or generators). For each seed there is a corresponding region, called a Voronoi cell,
consisting of all points of the plane closer to that seed than to any other.
Source: https://en.wikipedia.org/wiki/Voronoi_diagram
K-Mean Clustering
How to know the pattern similarity with K-means clustering method?
Before answering that, let’s discuss what k-means is and how k-means does
clustering
THEORY — What is K-means ?
K-means clustering is a simple and elegant approach for partitioning a
data set into K distinct, non-overlapping clusters. To perform K-means
clustering, we must first specify the desired number of clusters K; then the
K-means algorithm will assign each observation to exactly one of the K
clusters
K-Mean Clustering
What is Clustering?
K-Mean Clustering
The K-means algorithm clusters data by trying to separate samples in n groups of
equal variance, minimizing a criterion known as the inertia or within-cluster sum-
of-squares. The K-means algorithm aims to choose centroid that minimise
the inertia, or within-cluster sum-of-squares criterion
How does it work?
K-Mean Clustering
How does it work?
Step 1. Determine the value “K”, the value “K” represents the number of clusters.
in this case, we’ll select K=3. That is to say, we want to identify 3 clusters.
(Is there any way to determine the value of K? yes there is, but we’ll talk later about it).
Step 2. Randomly select 3 distinct centroid (new data points as cluster initialization)
for example — attempts 1. “K” is equal 3 so there are 3 centroid, in which case it
will be the cluster initialization
K-Mean Clustering
How does it work?
Step 3. Measure the distance (euclidean distance) between each point and the
centroid. For example, measure the distance between first point and the centroid.
K-Mean Clustering
How does it work?
Step 4. Assign the each point to the nearest cluster
For example, measure the distance between first point and the centroid.
Do the same treatment for the other unlabeled point, until we get this
K-Mean Clustering
How does it work?
Step 5. Calculate the mean of each cluster as new centroid
Update the centroid with mean of each cluster

Step 6. Repeat step 3–5 with the new center of cluster
K-Mean Clustering
How does it work?
Repeat until stop:
• Convergence. (No further changes)
• Maximum number of iterations.
Since the clustering did not change at all during the last iteration, we’re done.
Has this process been completed? of course not

Remember, the K-means algorithm aims to choose centroid that minimise the inertia, or within-cluster sum-of-
squares criterion. So, how to assess the results of this clustering? let’s continue
K-Mean Clustering
How does it work?
Step 7. Calculate the variance of each cluster
Since K-means clustering can’t “see” the best clustering, it is only option is to keep track of these clusters, and their
total variance, and do the whole thing over again with different starting points.
K-Mean Clustering
How does it work?
Step 8. Repeat step 2–7 until get the lowest sum of variance
For example — attempts 2 with different random centroid
K-Mean Clustering
How does it work?
For example — attempts 3 with different random centroid
Repeat until stop: Until we get the lowest sum of variance and pick those cluster as our result
K-Mean Clustering
How does it work?
Has this process been completed? Yes
The result of clustering is
K-Mean Clustering
K-means Clustering Overview
K-Mean Clustering
References:
Python Data Science Handbook (In depth K-Mean Clustering): This is a very import source for understanding K-Mean
algorithm using Python https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
(Active login to my.torontomu.ca is required)
Staglianò, A., Ma, A., & Willis, G. (2017). Clustering and unsupervised learning. part 4, introduction to real-world
machine learning. O'Reilly. https://learning.oreilly.com/videos/clustering-and-unsupervised/9781492023951/
Albalate, A., & Minker, W. (2011;2013;). Semi-supervised and unsupervised machine learning: Novel strategies.
ISTE. https://doi.org/10.1002/9781118557693
https://learning.oreilly.com/library/view/semi-supervised-and-unsupervised/9781118586136/00_9781118586136_cover.html
Machine Learning with Weka - regression and clustering: https://www.youtube.com/watch?v=COpXNK0O8As&t=1200s

CCPS521 WIN2023 Week08 K Mean Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CCPS521 WIN2023 Week08 K Mean Clustering

Uploaded by

Copyright:

Available Formats

Introduction to Data

How do we know a point has the same group as another point?

How does it work?

Update the centroid with mean of each cluster

Has this process been completed? of course not

For example — attempts 3 with different random centroid

Machine Learning with Weka - regression and clustering: https://www.youtube.com/watch?v=COpXNK0O8As&t=1200s

You might also like