You are on page 1of 177

Pattern Recognition: Clustering

Badri Narayan Subudhi


Indian Institute of Technology Jammu
NH44, Nagrota, Jagti, Jammu
Subudhi.badri@gmail.com
What is Pattern Recognition?

• Pattern Recognition is the study of how machines can:


– observe the environment,
– learn to distinguish patterns of interest,
– make sound and reasonable decisions about the
categories of the patterns.

• What is a pattern?
• What kinds of category we have?
What is a pattern?

• A pattern is an abstraction, represented by a set of measurements


describing a “physical” object.
• Many types of patterns exist:
– visual, temporal, sonic, logical, ...
What is a Pattern Class (or category)?

– Is a set of patterns sharing common attributes

– A collection of “similar”, not necessarily identical, objects


What is Features ?

• Features are properties of an object:

– Ideally representative of a specific type (i.e. class)


of objects
– Perceptual relevant
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
– These patterns are then utilized to predict the
values of the target attribute in future data
instances.
• Unsupervised learning: The data have no
target attribute.
– We want to explore the data to find some intrinsic
structures in them.
Clustering
• Clustering is a technique for finding similarity groups in
data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
• Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised
learning.
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
What is a natural grouping of these objects?
What is a natural grouping of these objects?

Clustering is subjective

Simpson's Family School Employees Females Males


Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns

• What could “similar” mean?


– One option: small Euclidean distance (squared)
dist(x, y) = ||x − y|| 22
– Clustering results are crucially dependent on the measure of
similarity (or distance) between “points” to be clustered
What is Similarity?
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and
O2 is a real number denoted by D(O1,O2)

Peter Piotr

0.23 3 342.7
Clustering examples
Image segmentation
Goal: Break up the image into meaningful or
perceptually similar regions
Clustering gene expression data
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Different Clustering
• Partitioning Clustering: K-mean, K-medoid, PAM
• Fuzzy Clustering: FCM
• Hierarchical Clustering: AGNES, DIANA
• Density-Based Clustering.: DBSCAN, Mean-shift
• OPTICS (Ordering Points to Identify Clustering
Structure)
• Kernelized Clustering
• Probabilistic clustering
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
• The k-means algorithm partitions the given
data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means Algorithm 1:
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2
k3
Mathematical Perspective
Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional
real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S =
{S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance).
Formally, the objective is to find:

where μi is the mean (also called centroid) of points in Si, i.e.

|Si |is the size of Si, and ||..|| is the usual L2 norm
Clustering criterion ..

1. Similarity/distance function

2. Stopping criterion

3. Cluster Quality
1. Distance functions for numeric attributes

• Most commonly used functions are


– Euclidean distance and
– Manhattan (city block) distance
• We denote distance with: dist(xi, xj), where xi
and xj are data points (vectors)
• They are special cases of Minkowski distance.
h is positive integer.
Euclidean distance and Manhattan distance
• If h = 2, it is the Euclidean distance

• If h = 1, it is the Manhattan distance

• Weighted Euclidean distance


Squared distance and Chebychev distance
• Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.

• Chebychev distance: one wants to define two


data points as "different" if they are different
on any one of the attributes.
What properties should a distance measure
have?

• D(A,B) = D(B,A) Symmetry


• D(A,A) = 0 Constancy of Self-Similarity
• D(A,B) = 0 iif A= B Positivity (Separation)
• D(A,B)  D(A,C) + D(B,C) Triangular Inequality
Intuitions behind desirable distance
measure properties

D(A,B) = D(B,A)
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like
Alex.”

D(A,A) = 0
Otherwise you could claim “Alex looks more like Bob, than Bob does.”
Distance functions for binary and
nominal attributes
• Binary attribute: has two values or states but
no ordering relationships, e.g.,
– Gender: male and female.
– Weather: rain, sunny
• We use a confusion matrix to introduce the
distance functions/measures.
• Let the ith and jth data points be xi and xj
(vectors)
Confusion matrix
Symmetric binary attributes
• A binary attribute is symmetric if both of its
states (0 and 1) have equal importance, and
carry the same weights, e.g., male and female
of the attribute Gender
• Distance function: Simple Matching
Coefficient, proportion of mismatches of their
values
Symmetric binary attributes: example
Asymmetric binary attributes
• Asymmetric: if one of the states is more
important or more valuable than the other.
– By convention, state 1 represents the more
important state, which is typically the rare or
infrequent state.
– Jaccard coefficient is a popular measure

– We can have some variations, adding weights


Nominal attributes
• Nominal attributes: with more than two states
or values.
– the commonly used distance measure is also
based on the simple matching method.
– Given two data points xi and xj, let the number of
attributes be r, and the number of values that
match in xi and xj be q.
Normalization

Technique to force the attributes to have a common value range


🞄what is the need ?
Consider the following pair of data points
xi: (0.1, 20) and xj: (0.9, 720).

dist(xi , x j ) = (0.9 − 0.1)2 + (720− 20)2 = 700.000457,

The distance is almost completely dominated by (720-20) = 700.

Standardize attributes: to force the attributes to have a common


value range
Interval-scaled attributes
• Their values are real numbers following a
linear scale.
– The difference in Age between 10 and 20 is the
same as that between 40 and 50.
– The key idea is that intervals keep the same
importance through out the scale
• Two main approaches to standardize interval
scaled attributes, range and z-score. f is an
attribute
Interval-scaled attributes (cont …)
• Z-score: transforms the attribute values so that they have
a mean of zero and a mean absolute deviation of 1. The
mean absolute deviation of attribute f, denoted by sf, is
computed as follows

Z-score:
Is normalization desirable?
Other distance/similarity measures

Distance between two instances x and x’, where q >= 1 is a selectable


parameter and d is the number of attributes (called the Minkowski Metric)

When features are binary


this becomes the number of
attributes shared
Cosine of the angle between two vectors (instances) givesby x and x’
a similarity function: divided by the geometric mean of
the number of attributes in x and
the number in x’. A simplification
of this is:
2. Stopping criteria
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE = 
j =1 xC
dist(x,m j ) 2
j

🞄 Ci is the jth cluster, mj is the centroid of cluster Cj (the


mean vector of all the data points in Cj), and dist(x,
mj) is the distance between data point x and centroid
m j.
An example

+
+
An example (cont …)
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Importance of Choosing Initial Centroids (Case i)

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids (Case i)

Iteration 1 Iteration 2 Iteration 3


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)

Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids (Case ii)

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting one centroid from
each cluster is small.
– Chance is relatively small when K is large

– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Sometimes the initial centroids will readjust themselves in ‘right’ way, and
sometimes they don’t

– Consider an example of five pairs of clusters

• Initial centers from different clusters may produce good clusters


Solutions to Initial Centroids Problem
• Multiple runs
– Helps, but probability is not on your side

• Sample and use hierarchical clustering to determine initial centroids

• Select more than k initial centroids and then select among these initial centroids
– Select most widely separated

• Post-processing

• Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing
• Pre-processing
– Normalize the data

– Eliminate outliers

• Post-processing
– Eliminate small clusters that may represent outliers

– Split ‘loose’ clusters, i.e., clusters with relatively high SSE

– Merge clusters that are ‘close’ and that have relatively low SSE

– Can use these steps during the clustering process


• ISODATA
Bisecting K-means

• Bisecting K-means algorithm


– Variant of K-means that can produce a partitional or a
hierarchical clustering
Bisecting K-means Example
Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes

– Densities

– Non-globular shapes

• K-means has problems when the data contains outliers.


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)


Overcoming K-means Limitations

Original Points K-means


Clusters
One solution is to use many clusters.
– Find parts of clusters, but need to put together.
Overcoming K-means Limitations

Original Points K-means


Clusters
One solution is to use many clusters.
– Find parts of clusters, but need to put together.
Overcoming K-means Limitations

Original Points K-means


Clusters
One solution is to use many clusters.
– Find parts of clusters, but need to put together.
Limitations of K-means: Outlier Problem

• The k-means algorithm is sensitive to outliers !

– Since an object with an extremely large value may substantially distort the

distribution of the data.

• Solution: Instead of taking the mean value of the object in a cluster as a reference point,

medoids can be used, which is the most centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K-medoid Algorithm:
1. Decide on a value for k.
2. Initialize the k cluster centers (Physical objects in data).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center (medoid).
4. Re-estimate the k cluster centers, by computing the
median of data in a cluster.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-medoid Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

k1
4

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2
k3
The K-Medoids Clustering Method

• Find representative objects, called medoids, in clusters

• PAM (Partitioning Around Medoids, 1987)

– starts from an initial set of medoids and iteratively replaces one of the medoids by

one of the non-medoids if it improves the total distance of the resulting clustering.

– PAM works effectively for small data sets, but does not scale well for large data sets.

• CLARA (Kaufmann & Rousseeuw, 1990)

• CLARANS (Ng & Han, 1994): Randomized sampling


PAM (Partitioning Around Medoids)

• Use real objects to represent the clusters (called medoids)

1. Select k representative objects arbitrarily

2. For each pair of selected object (i) and non-selected object (h), calculate the

Total swapping Cost (TCih)

3. For each pair of i and h,

1. If TCih < 0, i is replaced by h

2. Then assign each non-selected object to the most similar representative object

4. repeat steps 2-3 until there is no change in the medoids or in TCih.


Total swapping Cost (TCih)

Total swapping cost TCih=jCjih


– Where Cjih is the cost of swapping i with h for all non medoid objects j

– Cjih will vary depending upon different cases


Sum of Squared Error (SSE)
𝑘

SSE (𝐸) = ෍ ෍ 𝑑2(𝑗, 𝑖)


𝑖=1 𝑗∈𝐶𝑖

𝑦3
𝑥2 𝑦2
𝑥3
C2
𝑥1
C1
𝑦1

SSE (𝐸) = 𝑑 2 𝑥1 ,𝒄𝟏 + 𝑑 2 𝑥2 ,𝒄𝟏 + ⋯ + 𝑑 2 𝑥𝑛1 ,𝒄𝟏 +


𝑑 2 𝑦1 ,𝒄𝟐 + 𝑑 2 𝑦2 ,𝒄𝟐 + ⋯ +
𝑑 2 (𝑦𝑛2 ,𝒄𝟐 )
PAM or K-Medoids: Example

Data Objects 9
8
3

A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
Goal: create two clusters
08 7 4
09 8 5 Choose randmly two medoids
010 7 6
08 = (7,4) and 02 = (3,4)
PAM or K-Medoids: Example

Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Assign each object to the closest representative object
08 7 4
09 8 5 Using L1 Metric (Manhattan), we form the following clusters

010 7 6 Cluster1 = {01, 02, 03, 04}


Cluster2 = {05, 06, 07, 08, 09, 010}
PAM or K-Medoids: Example

Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Compute the absolute error criterion [for the set of
08 7 4 Medoids (O2,O8)]
09 8 5
𝒌
010 7 6
𝑬 = ෍ ෍ 𝒑 − 𝑶𝒊 = 𝑶𝟏 − 𝑶𝟐 + 𝑶𝟑 − 𝑶𝟐 + 𝑶𝟒 − 𝑶𝟐 +
𝒊=𝟏 𝒑∈𝑪𝒊

𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖
PAM or K-Medoids: Example

Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 7 4 The absolute error criterion [for the set of Medoids (O2,O8)]

09 8 5
010 7 6
E = (3+4+4)+(3+1+1+2+2) = 20
PAM or K-Medoids: Example

Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 • Choose a random object 07
08 7 4
• Swap 08 and 07
09 8 5
• Compute the absolute error criterion [for the set of
010 7 6
Medoids (02,07)

E = (3+4+4)+(2+2+1+3+3) = 22
PAM or K-Medoids: Example

Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 → Compute the cost function
08 7 4
Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ]
09 8 5
010 7 6
S=22-20
S> 0 => It is a bad idea to replace 08 by 07
PAM or K-Medoids: Example

Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8

04 4 7 3 7
5

05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 C687 = d(07,06) - d(08,06)=2-1=1
08 7 4 C587 = d(07,05) - d(08,05)=2-3=-1
C187=0, C387=0, C487=0
09 8 5 C987 = d(07,09) - d(08,09)=3-2=1
010 7 6
C1087 = d(07,010) - d(08,010)=3-2=1

TCih=jCjih=1-1+1+1=2=S
What Is the Problem with PAM?

• PAM is more robust than k-means in the presence of noise and outliers because a

medoid is less influenced by outliers or other extreme values than a mean.

• PAM works efficiently for small data sets but does not scale well for large data

sets.

– O(k(n-k)2 ) for each iteration; where n is # of data, k is # of clusters.

➔ Sampling based method, CLARA (Clustering LARge Applications)


CLARA (Clustering Large Applications)

• CLARA (Kaufmann and Rousseeuw in 1990)

• It draws multiple samples of the data set, applies PAM on each sample, and gives

the best clustering as the output.

• Strength: deals with larger data sets than PAM.

• Weakness:

– Efficiency depends on the sample size.

– A good clustering based on samples will not necessarily represent a good

clustering of the whole data set if the sample is biased.


CLARA (Clustering Large Applications)

• CLARA draws a sample of the dataset and applies PAM on the sample in order

to find the medoids.

• If the sample is best representative of the entire dataset then the medoids of the

sample should approximate the medoids of the entire dataset.

• Medoids are chosen from the sample.

– Note that the algorithm cannot find the best solution if one of the best k-

medoids is not among the selected sample.


CLARA (Clustering Large Applications)

PAM

sample
CLARA (Clustering Large Applications)

Choose the best clustering

Clusters Clusters Clusters


❖ To improve the approximation, multiplesa

PAM PAM PAM


mples are drawn and the best clustering

is returned as the output.


sample1 sample2 samplem


❖ The clustering accuracy is measured by

the average dissimilarity of all objects in

the entire dataset.

❖ Experiments show that 5 samples of size 40+2

k give satisfactory results


CLARANS (“Randomized” CLARA)

• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng


and Han’94).

• CLARANS draws sample of neighbors dynamically.

• Randomly select k objects in data as current medoids

• Randomly select x and replace x by y which is not a medoid currently.

• Is replacing x by y improve the SSE then make the replacement

• Do such l-times and after l-times the medoid are local optimum

• Perform the randomized operation m times and return the best local
optimum as final results.
CLARANS Properties

• Advantages

– Experiments show that CLARANS is more effective than both PAM and CLARA
– Handles outliers

• Disadvantages
– The computational complexity of CLARANS is O(n2), where n is the number of objects

– The clustering quality depends on the sampling method


Note: A sequence of partitions is called "hierarchical" if each cluster
in a given partition is the union of clusters in the next larger partition.
P4 P3 P2 P1

Top: hierarchical sequence of partitions


Bottom: non hierarchical sequence
We begin with a distance
matrix which contains the
distances between every pair
of objects in our database.

0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best
But how do we compute distances
between clusters rather than
Consider all objects? Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-up (agglomerative)

• Have a distance measure on pairs of objects.


• 𝑑 𝑥, 𝑦 : Distance between 𝑥 and 𝑦

• Single linkage: dist A, B = min′ d(x, x′)


x∈𝐴,𝑥 ∈𝐵

• Complete linkage: dist A, B = max



d(x, x′)
x∈𝐴,𝑥 ∈𝐵

• Average linkage: dist A, B = average d(x, x′)


x∈𝐴,𝑥 ′ ∈𝐵

𝐴 |𝐵|
Computing distance
between clusters: Single

Link
cluster distance = distance of two
closest members in each class

• - Potentially long
and skinny
clusters
Example: single link
1 2 3 4 5
1 0 

2 2 0 

3 6 3 0 
 
4 10 9 7 0 
5  9 8 5 4 0

5
4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5
1 0  (1,2) 0
 
2 2 0   
 3 3 0 
3 6 3 0 
  4 9 7 0 
4 10 9 7 0   
5 8 5 4 0 
5  9 8 5 4 0

5
d (1,2),3 = min{d1,3 , d 2,3 } = min{6,3} = 3
d (1,2),4 = min{d1,4 , d 2,4 } = min{10,9} = 9 4
d (1,2),5 = min{d1,5 ,d 2,5 } = min{9,8} = 8 3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0
   (1,2,3) 0 
2 2 0
  
 
4 7 0
3 3 0 

3 6 3 0   
4  7 0  4 9 7 0 
9
  5 5 4 0
10  5 8 5 4 0 
5  9 8 
5 4 0

5
d(1,2,3),4 = min{d(1,2),4 ,d 3,4 } = min{9,7} = 7
d (1,2,3),5 = min{d (1,2),5 , d3,5 } = min{8,5} = 5 4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0
   (1,2,3) 0 
2 2 0
  
 
4 7 0
3 3 0 

3 6 3 0   
4  7 0  4 9 7 0 
9
  5 5 4 0
10  5 8 5 4 0 
5  9 8 
5 4 0

5
d (1,2,3),( 4,5) = min{d (1,2,3),4 ,d (1,2,3),5 } = 5 4
3
2
1
Computing distance
between clusters: :
Complete Link
• cluster distance = distance of two farthest
members

+ tight clusters
Computing distance
between clusters:
Average Link
• cluster distance = average distance of all
pairs

the most widely


used measure
Robust against
noise
Single linkage

Height represents 2

distance between objects 1

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

/ clusters Average linkage


Hierarchical divisive clustering
There are divisive versions of single linkage, average linkage, and
Ward’s method.
Divisive version of single linkage:
• Compute minimal spanning tree (graph connecting all the objects
with smallest total edge length.
• Break longest edge to obtain 2 subtrees, and a corresponding
partition of the objects.
• Apply process recursively to the subtrees.
Agglomerative and divisive versions of single linkage give identical
results (more later).
Divisive version of Ward’s method.
Given cluster R.
Need to find split of R into 2 groups P,Q to minimize

RSS ( P , Q) =  x i − x P +  x j − xQ
2 2

i P j Q

or, equivalently, to maximize Ward’s distance between P and Q.

Note: No computationally feasible method to find optimal P, Q for


large |R|. Have to use approximation.
Iterative algorithm to search for the optimal Ward’s split
Project observations in R on largest principal component.
Split at median to obtain initial clusters P, Q.
Repeat {
Assign each observation to cluster with closest mean
Re-compute cluster means
} Until convergence

Note:
• Each step reduces RSS(P, Q)
• No guarantee to find optimal partition.
Fuzzy Set

•Fuzzy sets are sets whose elements have degrees of


membership.
•A fuzzy set is a pair ( A , m ) where A is a set and m : A
→[0,1]
–For each x  A , m(x) is called the grade of
membership of x in (A,m). For a finite set A =
{x1,...,xn}, the fuzzy set (A,m) is often denoted
by{m(x1) / x1,...,m(xn) / xn}.
–m(x) = 0 : x is not included in (A, m)
–m(x) = 1: x is fully included in (A, m)
Fuzzy C-Means Clustering

• Fuzzy c-means (FCM) is a method of clustering


which allows one piece of data to belong to
two or more clusters
• Be frequently used in pattern recognition.
Fuzzy C-Means Clustering

• Base on minimization of the following objective function:

• m is any real number greater than 1


• uij is the degree of membership of xi in the cluster j
• xi is the i-th of d-dimensional measured data
• cj is the d-dimension center of the cluster
• ||*|| is any norm expressing the similarity between any measured
data and the center
FCM algorithm
• The algorithm is composed of the following steps
1. Initialize U=[uij] matrix, U(0)
2. At k-step: calculate the centers vectors C(k)=[cj] with
U(k)
FCM algorithm

• The algorithm is composed of the following steps


3. Update U(k) , U(k+1)

4. If ||U(k+1) - U(k)||< ε (maxij {|uij(k+1)-uij(k)|})


then STOP; otherwise return to step 2.
FCM advantages

• Gives best result for overlapped data set and


comparatively better then k-means algorithm.
• Unlike k-means where data point must
exclusively belong to one cluster center here
data point is assigned membership to each
cluster center as a result of which data point
may belong to more then one cluster center.
FCM disadvantages

• Apriori specification of the number of clusters.


• With lower value of ε we get the better result but at the
expense of more number of iteration.
• Euclidean distance measures can unequally weight
underlying factors.
1. DBSCAN
2. DENCLUE
Density-Based Clustering
Methods
• Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly
constructed density function
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
– OPTICS: Ankerst, et al (SIGMOD’99).
– CLIQUE: Agrawal, et al. (SIGMOD’98)
DBSCAN

• DBSCAN is a density-based algorithm.


– Density = number of points within a specified radius r (Eps)

– A point is a core point if it has more than a specified number of points


(MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point

– A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm (simplified view for teaching)
1. Create a graph whose nodes are the points to be clustered
2. For each core-point c create an edge from c to every point p in
the -neighborhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by going
forward;
1. create a cluster containing X{c}
2. N=N/(X{c})
7. Continue with step 4
Remark: points that are not assigned to any cluster are outliers;
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border


and noise

Eps = 10, MinPts = 4


When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Complexity DBSCAN
• Time Complexity: O(n2)—for each point it
has to be determined if it is a core point,
can be reduced to O(n*log(n)) in lower
dimensional spaces by using efficient data
structures (n is the number of objects to be
clustered);
• Space Complexity: O(n).
Summary DBSCAN
• Good: can detect arbitrary shapes, not very
sensitive to noise, supports outlier detection,
complexity is kind of okay, beside K-means the
second most used clustering algorithm.
• Bad: does not work well in high-dimensional
datasets, parameter selection is tricky, has
problems of identifying clusters of varying
densities (→SSN algorithm), density estimation
is kind of simplistic (→does not create a real
density function, but rather a graph of density-
connected points)
DBSCAN Algorithm Revisited
• Eliminate noise points
• Perform clustering on the remaining points:
DENCLUE

• DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)


• Major features
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
– Significant faster than existing algorithm (faster than DBSCAN
by a factor of up to 45) ????????
– But needs a large number of parameters
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually
contain data points and manages these cells in a tree-based access
structure.
• Influence function: describes the impact of a data point within its
neighborhood.
• Overall density of the data space can be calculated as the sum of the
influence function of all data points.
• Clusters can be determined using hill climbing by identifying density
attractors; density attractors are local maximal of the overall density
function.
• Objects that are associated with the same density attractor belong to the
same cluster.
Gradient: The steepness of a slope

• Example d ( x , y )2

f Gaussian ( x , y ) = e 2 2

d ( x , xi ) 2

( x ) =  i =1 e
N
D 2 2
f Gaussian
d ( x , xi ) 2

( x, xi ) = i =1 ( xi − x)  e
N
f D
Gaussian
2 2
Example: Density Computation

D={x1,x2,x3,x4}

fDGaussian(x)= influence(x,x1) + influence(x,x2) + influence(x,x3)


+ influence(x4)=0.04+0.06+0.08+0.6=0.78

x1
0.04 x3
0.08
y

x2 x4
0.06 x 0.6

Remark: the density value of y would be larger than the one for x
Density Attractor
Examples of DENCLUE Clusters
Basic Steps DENCLUE Algorithms

1. Determine density attractors


2. Associate data objects with density
attractors using hill climbing
3. Possibly, merge the initial clusters further
relying on a hierarchical clustering approach
(optional; not covered in this lecture)
Cluster evaluation: ground truth

🞄We use some labeled data (for classification)


🞄Assumption: Each class is a cluster.
🞄 After clustering, a confusion matrix is constructed. From
the matrix, we compute various measurements,
entropy, purity, precision, recall and F-score.
🞄Let the classes in the data D be C = (c1, c2, …, ck). The clustering
method produces k clusters, which divides D into k disjoint
subsets, D1, D2, …, Dk.
Confusion Matrix

Predicted condition
Total population
Positive (PP) Negative (PN)
=P+N
False
True positive (TP)
Positive (P) negative (FN)

Actual condition
False
True
Negative (N) positive (FP)
negative (TN)
Measures

Sensitivity = TP / (TP + FN) = TPR (True Positive Rate)

Accuracy = (TP + TN) / (TP + FP + TN + FN)


Evaluation measures: Entropy
Evaluation measures: purity
Learning of Patterns?

– Supervised learning: a teacher provides a category label


or cost for each pattern in the training set.

➡Classification

–Unsupervised learning: the system forms clusters or


natural groupings of the input patterns (based on some
similarity criteria).

➡Clustering
Supervised Training/Learning
– a “teacher” provides labeled training sets, used
to train a classifier
Classifier
A classifier partitions sample space X into class-labeled regions such that
X = X 1  X 2    X |Y | and X i  X j = {}

X1
X1 X3 X1
X2
X2 X3

The classification consists of determining to which region a feature vector x belongs to.
Borders between decision boundaries are called decision regions.
Components of classifier system

Sensors and Feature Class


Pattern Classifier
preprocessing extraction assignment

Teacher Learning algorithm

• Sensors and preprocessing.


• A feature extraction aims to create discriminative features good for classification.
• A classifier.
• A teacher provides information about hidden state -- supervised learning.
• A learning algorithm sets PR from training examples.
Basic concepts
Pattern
 x1  Feature vector x  X
x  - A vector of observations (measurements).
 2 = x
y  - x is a point in feature space X .
 
 xn 
Hidden state y Y
- Cannot be directly measured.
- Patterns with equal hidden state belong to the same class.

Task
- To design a classifer (decision rule) q : X → Y
which decides about a hidden state based on an onbservation.
Training Samples

When a few labeled patterns can be collected by experts, Supervised


learning can be opted instead of the unsupervised approach to make full
utilization of labeled patterns.

In a dataset a training set can be implemented to build up a model.


Example
height Task: American-Indian recognition.

 x1  Y = { A, I }
The set of hidden state is
x  = x The feature space is X =  2
weight  2

Training examples {( x1 , y1 ),  , ( x l , yl )}

Linear classifier: x2 y=A

 A if (w  x) + b  0
q(x) = 
 I if (w  x) + b  0
y=I ( w  x) + b = 0
x1
Minimum Distance Classifier
• Each class is represented by its class1
Test pattern

mean vector
• Training is performed by
calculating the mean of the X2

feature vectors of each class


class1
• New patterns are classified by
class2
finding the closest mean vector
• The boundary is the
X1
perpendicular bisector of the
line joining the mean points.
Minimum Distance Classifier

1
mj =
Nj
 X
X
j = 1,2,, M
j

N j = number of pixels from class  j

Dj ( X ) = X − mj (Euclidean distance)
Assign X to i if Di ( X )  D j ( X ), j = 1,2,, M ; j  i
Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used
for both classification as well as regression predictive problems.

❖ However, it is mainly used for classification predictive problems in industry.

There are three categories of learning algorithms:

1. Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase or model and uses all the data for training while classification.

2. Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm


because it doesn’t assume anything about the underlying data.

3. Eager learning algorithm - Eager learners, when given a set of training tuples, will
construct a generalization model before receiving new (e.g., test) tuples to classify.
Lazy learners

•‘Lazy’: Do not create a model of the training instances in advance

•When an instance arrives for testing, runs the algorithm to get the class
prediction

•Example, K – nearest neighbour classifier (K – NN classifier)

“One is known by the company one keeps”


Different Learning Methods
• Eager Learning

Any random movement


=>It’s a mouse

I saw a mouse!
Instance-based Learning

Its very similar to a


Desktop!!
Pattern-Based Classification: Nearest
Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably
a duck
Compute
Distance Test Record

Training Choose k of the


Records “nearest” records

175
K-NN classifier schematic

For a test instance,


1) Calculate distances from training pts.
2) Find K-nearest neighbours (say, K = 3)
3) Assign class label based on majority
KNN Algorithm:
K-nearest neighbors (KNN) algorithm training data with the help of any of
uses ‘feature similarity’ to predict the the method namely: Euclidean,
values of new data points which Manhattan or Hamming distance.
further means that the new data The most commonly used method to
point will be assigned a value based calculate distance is Euclidean.
on how closely it matches the points 3.2 − Now, based on the distance
in the training set. We can value,
understand its working with the help sort them in ascending order.
of following steps −
Step 1 − For implementing any 3.3 − Next, it will choose the top K
algorithm, we need dataset. So rows
during the first step of KNN, we must from the sorted array.
load the training as well as test data. 3.4 − Now, it will assign a class to
Step 2 − Next, we need to choose the the test
value of K i.e. the nearest data points. point based on most
K can be any integer. frequent class
Step 3 − For each point in the test of these row
data do the following − Step 4 − End
3.1 − Calculate the distance
between test data and each row of
1-Nearest Neighbor
3-Nearest Neighbor
Example-1:
• The following is an example to understand the
concept of K and working of KNN algorithm
• Suppose we have a dataset which can be plotted
as follows:
Example-1 (Conti..)
• Now, we need to classify new
data point with black dot (at
point 60,60) into blue or red
class. We are assuming K = 3
i.e. it would find three nearest
data points. It is shown in the
following diagram:

• We can see in the beside


diagram the three nearest
neighbors of the data point
with black dot. Among those
three, two of them lies in
Red class hence the black
dot will also be assigned in
red class.
Advantages
1. No Training Period: KNN is called Lazy Learner (Instance based
learning). It does not learn anything in the training period. It does
not derive any discriminative function from the training data. It
stores the training dataset and learns from it only at the time of
making real time predictions. This makes the KNN algorithm
much faster than other algorithms that require training e.g.
Linear Regression etc.
2. Since the KNN algorithm requires no training before making
predictions, new data can be added seamlessly which will not
impact the accuracy of the algorithm.
3. KNN is very easy to implement. There are only two
parameters required to implement KNN i.e. the value of K and
the distance function (e.g. Euclidean or Manhattan etc.)
Dis advantages
1. Does not work well with large dataset: In large datasets, the cost of
calculating the distance between the new point and each existing points is
huge which degrades the performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn't work
well with high dimensional data because with large number of dimensions, it
becomes difficult for the algorithm to calculate the distance in each dimension.
3. Need feature scaling: We need to do feature scaling (standardization and
normalization) before applying KNN algorithm to any dataset. If we don't do so,
KNN may generate wrong predictions.
4. Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise
in the dataset. We need to manually impute missing values and remove outliers.
K-NN classifier Issues

How to determine distances between values of categorical


attributes?
How to determine value of K?

Alternatives:
Alternatives:
1.1. Determine
Boolean distance (1 if same,The
K experimentally. 0 if different)
K that gives minimum
error is selected.
2. Differential grading (e.g. weather – ‘drizzling’ and ‘rainy’ are
closer than ‘rainy’ and ‘sunny’ )

You might also like