Chapter Four Part 2

Data Science and Big Data Analytics
CHAPTER FOUR
ADVANCED ANALYTICAL THEORY AND METHODS:
CLUSTERING
Outline
▪ Overview
▪ Clustering Algorithms
• Partitional Algorithms
✓ K-means
✓ Example for K-means Clustering
• Hierarchical
✓ Example for Hierarchical Clustering
• Density-based
✓ DBSCAN
12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 2

What is Clustering Analysis 1 of 2
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
▪ No a priori class labels
▪ No knowledge of what is “correct”
▪ Attempt to understand how different data points are related to each
other
▪ Greater similarity in a group, greater difference between groups ->
better clustering
▪ The goal of clustering is to reduce the amount of data by
categorizing or grouping similar data items together.
What is Clustering Analysis 1 of 2
▪ Supervised have access to class-labels
▪ Unsupervised doesn’t:
✓ Train a model similar to a supervised model
✓ Need ground truth
▪ Clustering is the partitioning of data into subgroups
✓ According to e.g. a distance function
▪ Two primary settings:
✓ Predictive clustering – trying to predict the cluster an instance
belongs to
✓ Descriptive clustering – clustering data to discover how
instances are connected
Clustering vs. classification 1 of 2

Clustering vs. classification 2 of 2

… More about the Clustering Analysis 1 of 2
▪ Hard cluster approaches
✓ – Instances belong to only one cluster
✓ – Often heuristic, i.e. provides a good enough cluster
solution
✓ – Most common approach
▪ Soft cluster approaches
✓ – Instances can belong to several clusters
✓ – Sometimes called Fuzzy cluster
▪ Clustering - the most important unsupervised learning
problem
… More about the Clustering Analysis 2 of 2

Applications of cluster analysis 1 of 2
▪ Marketing: finding groups of customers with similar behaviors given a large
database of customer data containing their properties and past buying records;
▪ Biology: classification of plants and animals given their features; • Libraries:
book ordering;
▪ Insurance: identifying groups of motor insurance policy holders with a high
average claim cost; identifying frauds;
▪ City-planning: identifying groups of houses according to their house type,
value and geographical location;
▪ Earthquake studies: clustering observed earthquake epicenters to identify
dangerous zones;
▪ WWW: document classification; clustering weblog data to discover groups of
similar access patterns.
Applications of cluster analysis 2 of 2
Source: https://www.youtube.com/watch?v=DfJJzu6Vzi4

Descriptive clustering
▪ Explorative clustering
▪ Provide knowledge about the data
▪ Can only be used to provide knowledge about the
data
✓Can’t predict new instance
▪ Can be performed by first clustering and then finding
the set of features associated with each cluster.

Example algorithms
▪ Most algorithms assume there are k number of clusters
▪ Often uses a distance function to compare similarity of instances
✓ Defines a distance between two points in a set
▪ Distance function is dependent on the data and problem
✓ Euclidian distance is common (requires numerical data)
✓ Jaccard distance,
✓ Manhattan distance and others
▪ The choice of algorithm depends on the problem
✓ Each algorithm is suited for different areas, or different problems
✓ Each algorithm is suited for different types of data
▪ Important that the choice of algorithm is carefully considered
✓ Algorithm affects the clustering solution
Evalution of Clustering Results
▪ No general metrics, depends on algorithms
✓ Low variance for 𝑘-Means
✓ High density for DBSCAN
✓…
▪ Often manual checks
✓ Do the clusters make sense?
✓ Can be difficult
✓ Very large data
✓ Many clusters
✓ High dimensional data

Distance Measurement- Euclidian distance

K-means Overview
▪ An unsupervised clustering algorithm
▪ “K” stands for number of clusters, it is typically a user input to the
algorithm; some criteria can be used to automatically estimate K
▪ It is an approximation to an NP-hard combinatorial optimization
problem
▪ K-means algorithm is iterative in nature
▪ It converges, however only a local minimum is obtained
▪ Works only for numerical data
▪ Easy to implement
How the K-Means Clustering algorithm works?

K-means: The Algorithm
▪ Place K points into the space represented by the objects that are being
clustered. These points represent initial group centroids.
▪ Assign each object to the group that has the closest centroid.
▪ When all objects have been assigned, recalculate the positions of the K
centroids.
▪ Repeat Steps 2 and 3 until the centroids no longer move. This produces
a separation of the objects into groups from which the metric to be
minimized can be calculated.
K-means clustering

K-Means Clustering: Example 1
▪ Lets assume that we have 4 different medicines and we want to
cluster these medicines into two different clusters
▪ Graphical representation →

Example 1
▪ Step 0: Randomly initialize centroids

Example 1
▪ Step 1: Find closest centroid to each data point using Euclidean
distance

Example 1
▪ Step 2: Recompute centroid for each group. New centroid (c1, c2,
..., cn) for the cluster C is found through:

Example 1
▪ Find nearest centroid using Euclidean distance

Example 1
▪ Step 2: Recompute centroid for each group New centroid (c1, c2,
..., cn) for the cluster C is found through:

Example 1
▪ Find nearest centroid using Euclidean distance
▪ *Iteration is completed and optimal solution is obtained.

Scenario - Initialization

Scenario – Assignment

Scenario – Update

Scenario – (Re)Assignment

Scenario – Update


Scenario – Update


Scenario – Update

Scenario – (Re)Assignment (no change)

Exercise

Limitations of K-means Clustering Alg

Limitations: K-means with clusters of different sizes

Limitations: K-means with clusters of different density

Limitations: K-means with clusters of non-globular shapes

Limitations: Initializing K-means

Hierarchical Clustering
▪ Tree-based approach
▪ Produces a set of nested clusters organized as a hierarchical tree
▪ Can be visualized as a dendrogram
✓ A tree like diagram that records the sequences of merges or
splits

Strengths of Hierarchical Clustering
▪ You do not have to assume any particular number of
clusters
✓ Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level
▪ They may correspond to meaningful taxonomies
✓ Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)

Hierarchical Clustering Algorithm
▪ Given a set of N items to be clustered, the process of
hierarchical clustering is as follows:
✓ Find the distances between the pair of object points
using a distance function.
✓ Find the closest or nearest pair of data points and
merge them into a single cluster.
✓ Compute distances between the data point and each
cluster based on linkage functions.
✓ Repeat steps 2 and 3 until all items are clustered into a
single cluster of size N.
Hierarchical Clustering
▪ Common linkage functions:
▪ Single linkage
✓ Distance between clusters are the smallest pairwise distance between
instances
▪ Complete linkage
✓ Distance between clusters are the largest pointwise distance
▪ Average linkage
✓ Distance between clusters are the average pointwise distance
▪ Centroid linkage
✓ Distance between clusters are the point distance between cluster means

Hierarchical Clustering -Example

▪ Finding distances between the objects

▪ Join the two closest points into a cluster.

▪ Reduce the distance matrix, using the linkage methods. Draw the
dendrogram.

▪ Join the two closest points into a cluster.

dendrogram.


dendrogram.

▪ Join last two clusters.

Hierarchical Clustering: Limitations
▪ Greedy: Once a decision is made to combine two
clusters, it cannot be undone
▪ No global objective function is directly minimized
▪ Sensitivity to noise and outliers
▪ Difficulty handling clusters of different sizes and
non-globular shapes
▪ Breaking large clusters

DBSCAN: Density-Based Spatial Clustering of Applications with Noise



▪ Each cluster contains at least one core point
▪ If p is a core point, then it forms a cluster together
with all points (core or non-core) that are reachable
from it.
▪ Non-core points can be part of a cluster, but they
form its "edge", since they cannot be used to reach
more points.


DBSCAN Algorithm

DBSCAN- Example

DBSCAN- Example
▪ Finding distances between the points

DBSCAN- Example
▪ Min = 3
▪ Epsilon =3
▪ Min = 2
▪ Epsilon =3
▪ Min = 3
▪ Epsilon =4

DBSCAN - Core, border and noise points
• Resistant to Noise
• Can handle clusters of
different shapes and sizes
DBSCAN - Challenges

DBSCAN Properties
▪ Discover clusters of arbitrary shape
▪ Handle noise
▪ Sensitive to differences in densities
▪ Can have trouble with high dimensional data
▪ Need to define neighborhood distance
▪ Difficult to set epsilon
Comparison of DBSCAN and K-means 1 of 2
▪ Both are partitional
▪ K-means has a prototype-based notion of a cluster; DB uses a density-
based notion
▪ K-means can find clusters that are not well separated. DBSCAN will
merge clusters that touch
▪ K-means prefers globular clusters; DBSCAN handles clusters of different
shapes and sizes
▪ K-means performs poorly in the presence of outliers; DBSCAN can
handle noise and outliers
▪ K-means can only be applied to data for which a centroid is meaningful;
DBSCAN requires a meaningful definition of density
Comparison of DBSCAN and K-means 2 of 2
▪ K-means works well for some types of high-dimensional data DBSCAN
works poorly on high-dimensional data
▪ Both techniques were designed for Euclidean data, but extended to
other types of data
▪ K-means is really assuming spherical Gaussian distributions; DBSCAN
makes no distribution assumptions;
▪ Because of random initialization, the clusters found by K-means can vary
from one run to another; DBSCAN always produces the same clusters
▪ DBSCAN automatically determines the number of clusters; K-means
does not
▪ K-means has only one parameter, DBSCAN has two.
Other Clustering Methods
▪ Expectation Maximization (EM)
✓ Used mainly for missing data
▪ Self-organizing Maps (SOM)
✓ Popular neural network methods for cluster analysis
✓ Reduces data dimensions
▪ Fuzzy-c means clustering
▪ Minimum cut clustering
▪ Mean-Shift Clustering
✓ Used mainly in image processing and computer vision
▪ etc.

Chapter Four Part 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter Four Part 2

Uploaded by

Copyright:

Available Formats

Data Science and Big Data Analytics

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 2

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 5

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 6

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 8

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 10

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 11

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 13

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 14

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 16

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 18

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 19

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 20

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 21

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 22

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 23

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 24

▪ *Iteration is completed and optimal solution is obtained.

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 26

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 27

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 28

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 29

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 30

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 31

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 32

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 33

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 34

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 35

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 36

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 37

12/26/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 38

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 39

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 40

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 41

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 42

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 43

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 45

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 46

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 47

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 48

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 49

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 50

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 51

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 52

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 53

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 54

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 55

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 56

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 57

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 58

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 59

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 60

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 61

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 62

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 63

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 64

12/27/2022 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER FOUR 66

▪ Minimum cut clustering

You might also like