You are on page 1of 70

Data Science and Big Data Analytics

CHAPTER FOUR
ADVANCED ANALYTICAL THEORY AND METHODS:
CLUSTERING
Outline
▪ Overview
▪ Clustering Algorithms
• Partitional Algorithms
✓ K-means
✓ Example for K-means Clustering
• Hierarchical
✓ Example for Hierarchical Clustering
• Density-based
✓ DBSCAN

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 2


What is Clustering Analysis 1 of 2
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
▪ No a priori class labels
▪ No knowledge of what is “correct”
▪ Attempt to understand how different data points are related to each
other
▪ Greater similarity in a group, greater difference between groups ->
better clustering
▪ The goal of clustering is to reduce the amount of data by
categorizing or grouping similar data items together.
12/24/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 3
What is Clustering Analysis 1 of 2
▪ Supervised have access to class-labels
▪ Unsupervised doesn’t:
✓ Train a model similar to a supervised model
✓ Need ground truth
▪ Clustering is the partitioning of data into subgroups
✓ According to e.g. a distance function
▪ Two primary settings:
✓ Predictive clustering – trying to predict the cluster an instance
belongs to
✓ Descriptive clustering – clustering data to discover how
instances are connected
12/24/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 4
Clustering vs. classification 1 of 2

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 5


Clustering vs. classification 2 of 2

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 6


… More about the Clustering Analysis 1 of 2
▪ Hard cluster approaches
✓ – Instances belong to only one cluster
✓ – Often heuristic, i.e. provides a good enough cluster
solution
✓ – Most common approach
▪ Soft cluster approaches
✓ – Instances can belong to several clusters
✓ – Sometimes called Fuzzy cluster
▪ Clustering - the most important unsupervised learning
problem
12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 7
… More about the Clustering Analysis 2 of 2

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 8


Applications of cluster analysis 1 of 2
▪ Marketing: finding groups of customers with similar behaviors given a large
database of customer data containing their properties and past buying records;
▪ Biology: classification of plants and animals given their features; • Libraries:
book ordering;
▪ Insurance: identifying groups of motor insurance policy holders with a high
average claim cost; identifying frauds;
▪ City-planning: identifying groups of houses according to their house type,
value and geographical location;
▪ Earthquake studies: clustering observed earthquake epicenters to identify
dangerous zones;
▪ WWW: document classification; clustering weblog data to discover groups of
similar access patterns.
12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 9
Applications of cluster analysis 2 of 2

Source: https://www.youtube.com/watch?v=DfJJzu6Vzi4

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 10


Descriptive clustering
▪ Explorative clustering
▪ Provide knowledge about the data
▪ Can only be used to provide knowledge about the
data
✓Can’t predict new instance
▪ Can be performed by first clustering and then finding
the set of features associated with each cluster.

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 11


Example algorithms
▪ Most algorithms assume there are k number of clusters
▪ Often uses a distance function to compare similarity of instances
✓ Defines a distance between two points in a set
▪ Distance function is dependent on the data and problem
✓ Euclidian distance is common (requires numerical data)
✓ Jaccard distance,
✓ Manhattan distance and others
▪ The choice of algorithm depends on the problem
✓ Each algorithm is suited for different areas, or different problems
✓ Each algorithm is suited for different types of data
▪ Important that the choice of algorithm is carefully considered
✓ Algorithm affects the clustering solution
12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 12
Evalution of Clustering Results
▪ No general metrics, depends on algorithms
✓ Low variance for 𝑘-Means
✓ High density for DBSCAN
✓…
▪ Often manual checks
✓ Do the clusters make sense?
✓ Can be difficult
✓ Very large data
✓ Many clusters
✓ High dimensional data

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 13


Distance Measurement- Euclidian distance

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 14


K-means Overview
▪ An unsupervised clustering algorithm
▪ “K” stands for number of clusters, it is typically a user input to the
algorithm; some criteria can be used to automatically estimate K
▪ It is an approximation to an NP-hard combinatorial optimization
problem
▪ K-means algorithm is iterative in nature
▪ It converges, however only a local minimum is obtained
▪ Works only for numerical data
▪ Easy to implement
12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 15
How the K-Means Clustering algorithm works?

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 16


K-means: The Algorithm

▪ Place K points into the space represented by the objects that are being
clustered. These points represent initial group centroids.
▪ Assign each object to the group that has the closest centroid.
▪ When all objects have been assigned, recalculate the positions of the K
centroids.
▪ Repeat Steps 2 and 3 until the centroids no longer move. This produces
a separation of the objects into groups from which the metric to be
minimized can be calculated.
12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 17
K-means clustering

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 18


K-Means Clustering: Example 1
▪ Lets assume that we have 4 different medicines and we want to
cluster these medicines into two different clusters

▪ Graphical representation →

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 19


Example 1
▪ Step 0: Randomly initialize centroids

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 20


Example 1
▪ Step 1: Find closest centroid to each data point using Euclidean
distance

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 21


Example 1
▪ Step 2: Recompute centroid for each group. New centroid (c1, c2,
..., cn) for the cluster C is found through:

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 22


Example 1
▪ Find nearest centroid using Euclidean distance

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 23


Example 1
▪ Step 2: Recompute centroid for each group New centroid (c1, c2,
..., cn) for the cluster C is found through:

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 24


Example 1
▪ Find nearest centroid using Euclidean distance

▪ *Iteration is completed and optimal solution is obtained.


12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 25
Scenario - Initialization

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 26


Scenario – Assignment

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 27


Scenario – Update

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 28


Scenario – (Re)Assignment

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 29


Scenario – Update

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 30


Scenario – (Re)Assignment

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 31


Scenario – Update

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 32


Scenario – (Re)Assignment

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 33


Scenario – Update

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 34


Scenario – (Re)Assignment (no change)

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 35


Exercise

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 36


Limitations of K-means Clustering Alg

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 37


Limitations: K-means with clusters of different sizes

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 38


Limitations: K-means with clusters of different density

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 39


Limitations: K-means with clusters of non-globular shapes

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 40


Limitations: Initializing K-means

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 41


Hierarchical Clustering
▪ Tree-based approach
▪ Produces a set of nested clusters organized as a hierarchical tree
▪ Can be visualized as a dendrogram
✓ A tree like diagram that records the sequences of merges or
splits

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 42


Strengths of Hierarchical Clustering
▪ You do not have to assume any particular number of
clusters
✓ Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level
▪ They may correspond to meaningful taxonomies
✓ Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 43


Hierarchical Clustering Algorithm
▪ Given a set of N items to be clustered, the process of
hierarchical clustering is as follows:
✓ Find the distances between the pair of object points
using a distance function.
✓ Find the closest or nearest pair of data points and
merge them into a single cluster.
✓ Compute distances between the data point and each
cluster based on linkage functions.
✓ Repeat steps 2 and 3 until all items are clustered into a
single cluster of size N.
12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 44
Hierarchical Clustering
▪ Common linkage functions:
▪ Single linkage
✓ Distance between clusters are the smallest pairwise distance between
instances
▪ Complete linkage
✓ Distance between clusters are the largest pointwise distance
▪ Average linkage
✓ Distance between clusters are the average pointwise distance
▪ Centroid linkage
✓ Distance between clusters are the point distance between cluster means

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 45


Hierarchical Clustering -Example

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 46


Hierarchical Clustering -Example
▪ Finding distances between the objects

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 47


Hierarchical Clustering -Example
▪ Join the two closest points into a cluster.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 48


Hierarchical Clustering -Example
▪ Reduce the distance matrix, using the linkage methods. Draw the
dendrogram.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 49


Hierarchical Clustering -Example
▪ Join the two closest points into a cluster.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 50


Hierarchical Clustering -Example
▪ Reduce the distance matrix, using the linkage methods. Draw the
dendrogram.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 51


Hierarchical Clustering -Example

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 52


Hierarchical Clustering -Example
▪ Reduce the distance matrix, using the linkage methods. Draw the
dendrogram.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 53


Hierarchical Clustering -Example
▪ Join last two clusters.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 54


Hierarchical Clustering: Limitations
▪ Greedy: Once a decision is made to combine two
clusters, it cannot be undone
▪ No global objective function is directly minimized
▪ Sensitivity to noise and outliers
▪ Difficulty handling clusters of different sizes and
non-globular shapes
▪ Breaking large clusters

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 55


DBSCAN: Density-Based Spatial Clustering of Applications with Noise

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 56


DBSCAN: Density-Based Spatial Clustering of Applications with Noise

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 57


DBSCAN: Density-Based Spatial Clustering of Applications with Noise

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 58


DBSCAN: Density-Based Spatial Clustering of Applications with Noise
▪ Each cluster contains at least one core point
▪ If p is a core point, then it forms a cluster together
with all points (core or non-core) that are reachable
from it.
▪ Non-core points can be part of a cluster, but they
form its "edge", since they cannot be used to reach
more points.

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 59


DBSCAN: Density-Based Spatial Clustering of Applications with Noise

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 60


DBSCAN Algorithm

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 61


DBSCAN- Example

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 62


DBSCAN- Example
▪ Finding distances between the points

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 63


DBSCAN- Example
▪ Min = 3
▪ Epsilon =3

▪ Min = 2
▪ Epsilon =3

▪ Min = 3
▪ Epsilon =4

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 64


DBSCAN - Core, border and noise points

• Resistant to Noise
• Can handle clusters of
different shapes and sizes
12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 65
DBSCAN - Challenges

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 66


DBSCAN Properties
▪ Discover clusters of arbitrary shape
▪ Handle noise
▪ Sensitive to differences in densities
▪ Can have trouble with high dimensional data
▪ Need to define neighborhood distance
▪ Difficult to set epsilon
12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 67
Comparison of DBSCAN and K-means 1 of 2
▪ Both are partitional
▪ K-means has a prototype-based notion of a cluster; DB uses a density-
based notion
▪ K-means can find clusters that are not well separated. DBSCAN will
merge clusters that touch
▪ K-means prefers globular clusters; DBSCAN handles clusters of different
shapes and sizes
▪ K-means performs poorly in the presence of outliers; DBSCAN can
handle noise and outliers
▪ K-means can only be applied to data for which a centroid is meaningful;
DBSCAN requires a meaningful definition of density
12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 68
Comparison of DBSCAN and K-means 2 of 2
▪ K-means works well for some types of high-dimensional data DBSCAN
works poorly on high-dimensional data
▪ Both techniques were designed for Euclidean data, but extended to
other types of data
▪ K-means is really assuming spherical Gaussian distributions; DBSCAN
makes no distribution assumptions;
▪ Because of random initialization, the clusters found by K-means can vary
from one run to another; DBSCAN always produces the same clusters
▪ DBSCAN automatically determines the number of clusters; K-means
does not
▪ K-means has only one parameter, DBSCAN has two.
12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 69
Other Clustering Methods
▪ Expectation Maximization (EM)
✓ Used mainly for missing data
▪ Self-organizing Maps (SOM)
✓ Popular neural network methods for cluster analysis
✓ Reduces data dimensions
▪ Fuzzy-c means clustering

▪ Minimum cut clustering

▪ Mean-Shift Clustering
✓ Used mainly in image processing and computer vision
▪ etc.
12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 70

You might also like