Pattern Recognition_clustering_classification

Pattern Recognition: Clustering
Badri Narayan Subudhi

Indian Institute of Technology Jammu
NH44, Nagrota, Jagti, Jammu
Subudhi.badri@gmail.com
What is Pattern Recognition?
• Pattern Recognition is the study of how machines can:

– observe the environment,
– learn to distinguish patterns of interest,
– make sound and reasonable decisions about the
categories of the patterns.
• What is a pattern?
• What kinds of category we have?
What is a pattern?
• A pattern is an abstraction, represented by a set of measurements

describing a “physical” object.
• Many types of patterns exist:
– visual, temporal, sonic, logical, ...
What is a Pattern Class (or category)?
– Is a set of patterns sharing common attributes
– A collection of “similar”, not necessarily identical, objects

What is Features ?
• Features are properties of an object:
– Ideally representative of a specific type (i.e. class)

of objects
– Perceptual relevant
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
– These patterns are then utilized to predict the
values of the target attribute in future data
instances.
• Unsupervised learning: The data have no
target attribute.
– We want to explore the data to find some intrinsic
structures in them.
Clustering
• Clustering is a technique for finding similarity groups in
data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
• Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised
learning.
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
What is a natural grouping of these objects?
What is a natural grouping of these objects?
Clustering is subjective
Simpson's Family School Employees Females Males

Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
Clustering
Clustering
• What could “similar” mean?

– One option: small Euclidean distance (squared)
dist(x, y) = ||x − y|| 22
– Clustering results are crucially dependent on the measure of
similarity (or distance) between “points” to be clustered
What is Similarity?
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and
O2 is a real number denoted by D(O1,O2)
Peter Piotr
0.23 3 342.7
Clustering examples
Image segmentation
Goal: Break up the image into meaningful or
perceptually similar regions
Clustering gene expression data
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Different Clustering
• Partitioning Clustering: K-mean, K-medoid, PAM
• Fuzzy Clustering: FCM
• Hierarchical Clustering: AGNES, DIANA
• Density-Based Clustering.: DBSCAN, Mean-shift
• OPTICS (Ordering Points to Identify Clustering
Structure)
• Kernelized Clustering
• Probabilistic clustering
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
• The k-means algorithm partitions the given
data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means Algorithm 1:
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
k1
k2
k3
Mathematical Perspective
Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional
real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S =
{S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance).
Formally, the objective is to find:
where μi is the mean (also called centroid) of points in Si, i.e.
|Si |is the size of Si, and ||..|| is the usual L2 norm
Clustering criterion ..
1. Similarity/distance function
2. Stopping criterion
3. Cluster Quality
1. Distance functions for numeric attributes
• Most commonly used functions are

– Euclidean distance and
– Manhattan (city block) distance
• We denote distance with: dist(xi, xj), where xi
and xj are data points (vectors)
• They are special cases of Minkowski distance.
h is positive integer.
Euclidean distance and Manhattan distance
• If h = 2, it is the Euclidean distance
• If h = 1, it is the Manhattan distance
• Weighted Euclidean distance

Squared distance and Chebychev distance
• Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
• Chebychev distance: one wants to define two

data points as "different" if they are different
on any one of the attributes.
What properties should a distance measure
have?
• D(A,B) = D(B,A) Symmetry

• D(A,A) = 0 Constancy of Self-Similarity
• D(A,B) = 0 iif A= B Positivity (Separation)
• D(A,B)  D(A,C) + D(B,C) Triangular Inequality
Intuitions behind desirable distance
measure properties
D(A,B) = D(B,A)
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like
Alex.”
D(A,A) = 0
Otherwise you could claim “Alex looks more like Bob, than Bob does.”
Distance functions for binary and
nominal attributes
• Binary attribute: has two values or states but
no ordering relationships, e.g.,
– Gender: male and female.
– Weather: rain, sunny
• We use a confusion matrix to introduce the
distance functions/measures.
• Let the ith and jth data points be xi and xj
(vectors)
Confusion matrix
Symmetric binary attributes
• A binary attribute is symmetric if both of its
states (0 and 1) have equal importance, and
carry the same weights, e.g., male and female
of the attribute Gender
• Distance function: Simple Matching
Coefficient, proportion of mismatches of their
values
Symmetric binary attributes: example
Asymmetric binary attributes
• Asymmetric: if one of the states is more
important or more valuable than the other.
– By convention, state 1 represents the more
important state, which is typically the rare or
infrequent state.
– Jaccard coefficient is a popular measure
– We can have some variations, adding weights

Nominal attributes
• Nominal attributes: with more than two states
or values.
– the commonly used distance measure is also
based on the simple matching method.
– Given two data points xi and xj, let the number of
attributes be r, and the number of values that
match in xi and xj be q.
Normalization
Technique to force the attributes to have a common value range

🞄what is the need ?
Consider the following pair of data points
xi: (0.1, 20) and xj: (0.9, 720).
dist(xi , x j ) = (0.9 − 0.1)2 + (720− 20)2 = 700.000457,
The distance is almost completely dominated by (720-20) = 700.
Standardize attributes: to force the attributes to have a common

value range
Interval-scaled attributes
• Their values are real numbers following a
linear scale.
– The difference in Age between 10 and 20 is the
same as that between 40 and 50.
– The key idea is that intervals keep the same
importance through out the scale
• Two main approaches to standardize interval
scaled attributes, range and z-score. f is an
attribute
Interval-scaled attributes (cont …)
• Z-score: transforms the attribute values so that they have
a mean of zero and a mean absolute deviation of 1. The
mean absolute deviation of attribute f, denoted by sf, is
computed as follows
Z-score:
Is normalization desirable?
Other distance/similarity measures
Distance between two instances x and x’, where q >= 1 is a selectable

parameter and d is the number of attributes (called the Minkowski Metric)
When features are binary

this becomes the number of
attributes shared
Cosine of the angle between two vectors (instances) givesby x and x’
a similarity function: divided by the geometric mean of
the number of attributes in x and
the number in x’. A simplification
of this is:
2. Stopping criteria
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE = 
j =1 xC
dist(x,m j ) 2
j
🞄 Ci is the jth cluster, mj is the centroid of cluster Cj (the

mean vector of all the data points in Cj), and dist(x,
mj) is the distance between data point x and centroid
m j.
An example
+
+
An example (cont …)
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
Optimal Clustering Sub-optimal Clustering

Importance of Choosing Initial Centroids (Case i)
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids (Case i)
Iteration 1 Iteration 2 Iteration 3

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids (Case ii)
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from
each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then
– For example, if K = 10, then probability = 10!/1010 = 0.00036
– Sometimes the initial centroids will readjust themselves in ‘right’ way, and
sometimes they don’t
– Consider an example of five pairs of clusters
• Initial centers from different clusters may produce good clusters

Solutions to Initial Centroids Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial centroids
• Select more than k initial centroids and then select among these initial centroids
– Select most widely separated
• Post-processing
• Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
– Can use these steps during the clustering process

• ISODATA
Bisecting K-means
• Bisecting K-means algorithm

– Variant of K-means that can produce a partitional or a
hierarchical clustering
Bisecting K-means Example
Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Limitations of K-means: Non-globular Shapes

Overcoming K-means Limitations
Original Points K-means

Clusters
One solution is to use many clusters.
– Find parts of clusters, but need to put together.

Clusters

Clusters
Limitations of K-means: Outlier Problem
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
• Solution: Instead of taking the mean value of the object in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K-medoid Algorithm:
1. Decide on a value for k.
2. Initialize the k cluster centers (Physical objects in data).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center (medoid).
4. Re-estimate the k cluster centers, by computing the
median of data in a cluster.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-medoid Clustering: Step 1
5
k1
4
k2
2
k3
0
0 1 2 3 4 5
5
0
0 1 2 3 4 5
5
0
0 1 2 3 4 5
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
k1
k2
k3
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the medoids by
one of the non-medoids if it improves the total distance of the resulting clustering.
– PAM works effectively for small data sets, but does not scale well for large data sets.
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling

PAM (Partitioning Around Medoids)
• Use real objects to represent the clusters (called medoids)
1. Select k representative objects arbitrarily
2. For each pair of selected object (i) and non-selected object (h), calculate the
Total swapping Cost (TCih)
3. For each pair of i and h,
1. If TCih < 0, i is replaced by h
2. Then assign each non-selected object to the most similar representative object
4. repeat steps 2-3 until there is no change in the medoids or in TCih.

Total swapping Cost (TCih)
Total swapping cost TCih=jCjih

– Where Cjih is the cost of swapping i with h for all non medoid objects j
– Cjih will vary depending upon different cases

Sum of Squared Error (SSE)
𝑘
SSE (𝐸) = ෍ ෍ 𝑑2(𝑗, 𝑖)

𝑖=1 𝑗∈𝐶𝑖
𝑦3
𝑥2 𝑦2
𝑥3
C2
𝑥1
C1
𝑦1
SSE (𝐸) = 𝑑 2 𝑥1 ,𝒄𝟏 + 𝑑 2 𝑥2 ,𝒄𝟏 + ⋯ + 𝑑 2 𝑥𝑛1 ,𝒄𝟏 +

𝑑 2 𝑦1 ,𝒄𝟐 + 𝑑 2 𝑦2 ,𝒄𝟐 + ⋯ +
𝑑 2 (𝑦𝑛2 ,𝒄𝟐 )
PAM or K-Medoids: Example
Data Objects 9
8
3
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
Goal: create two clusters
08 7 4
09 8 5 Choose randmly two medoids
010 7 6
08 = (7,4) and 02 = (3,4)
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Assign each object to the closest representative object
08 7 4
09 8 5 Using L1 Metric (Manhattan), we form the following clusters
010 7 6 Cluster1 = {01, 02, 03, 04}

Cluster2 = {05, 06, 07, 08, 09, 010}
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Compute the absolute error criterion [for the set of
08 7 4 Medoids (O2,O8)]
09 8 5
𝒌
010 7 6
𝑬 = ෍ ෍ 𝒑 − 𝑶𝒊 = 𝑶𝟏 − 𝑶𝟐 + 𝑶𝟑 − 𝑶𝟐 + 𝑶𝟒 − 𝑶𝟐 +
𝒊=𝟏 𝒑∈𝑪𝒊
𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 7 4 The absolute error criterion [for the set of Medoids (O2,O8)]
09 8 5
010 7 6
E = (3+4+4)+(3+1+1+2+2) = 20
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 • Choose a random object 07
08 7 4
• Swap 08 and 07
09 8 5
• Compute the absolute error criterion [for the set of
010 7 6
Medoids (02,07)
E = (3+4+4)+(2+2+1+3+3) = 22
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 → Compute the cost function
08 7 4
Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ]
09 8 5
010 7 6
S=22-20
S> 0 => It is a bad idea to replace 08 by 07
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 C687 = d(07,06) - d(08,06)=2-1=1
08 7 4 C587 = d(07,05) - d(08,05)=2-3=-1
C187=0, C387=0, C487=0
09 8 5 C987 = d(07,09) - d(08,09)=3-2=1
010 7 6
C1087 = d(07,010) - d(08,010)=3-2=1
TCih=jCjih=1-1+1+1=2=S
What Is the Problem with PAM?
• PAM is more robust than k-means in the presence of noise and outliers because a
medoid is less influenced by outliers or other extreme values than a mean.
• PAM works efficiently for small data sets but does not scale well for large data
sets.
– O(k(n-k)2 ) for each iteration; where n is # of data, k is # of clusters.
➔ Sampling based method, CLARA (Clustering LARge Applications)

CLARA (Clustering Large Applications)
• CLARA (Kaufmann and Rousseeuw in 1990)
• It draws multiple samples of the data set, applies PAM on each sample, and gives
the best clustering as the output.
• Strength: deals with larger data sets than PAM.
• Weakness:
– Efficiency depends on the sample size.
– A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased.

• CLARA draws a sample of the dataset and applies PAM on the sample in order
to find the medoids.
• If the sample is best representative of the entire dataset then the medoids of the
sample should approximate the medoids of the entire dataset.
• Medoids are chosen from the sample.
– Note that the algorithm cannot find the best solution if one of the best k-
medoids is not among the selected sample.

PAM
sample
Choose the best clustering
Clusters Clusters Clusters

❖ To improve the approximation, multiplesa
PAM PAM PAM

mples are drawn and the best clustering
is returned as the output.

…
sample1 sample2 samplem

❖ The clustering accuracy is measured by
the average dissimilarity of all objects in
the entire dataset.
❖ Experiments show that 5 samples of size 40+2
k give satisfactory results

CLARANS (“Randomized” CLARA)
• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng

and Han’94).
• CLARANS draws sample of neighbors dynamically.
• Randomly select k objects in data as current medoids
• Randomly select x and replace x by y which is not a medoid currently.
• Is replacing x by y improve the SSE then make the replacement
• Do such l-times and after l-times the medoid are local optimum
• Perform the randomized operation m times and return the best local
optimum as final results.
CLARANS Properties
• Advantages
– Experiments show that CLARANS is more effective than both PAM and CLARA
– Handles outliers
• Disadvantages
– The computational complexity of CLARANS is O(n2), where n is the number of objects
– The clustering quality depends on the sampling method

Note: A sequence of partitions is called "hierarchical" if each cluster
in a given partition is the union of clusters in the next larger partition.
P4 P3 P2 P1
Top: hierarchical sequence of partitions

Bottom: non hierarchical sequence
We begin with a distance
matrix which contains the
distances between every pair
of objects in our database.
0 8 8 7 7
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all Choose

possible … the best
merges…
are fused together.
Consider all
Choose
possible
merges… … the best
Consider all Choose

merges…
are fused together.
Consider all
Choose
possible
Consider all
Choose
possible
Consider all Choose

merges…
are fused together.
Consider all
Choose
possible
But how do we compute distances
between clusters rather than
Consider all objects? Choose
possible
Consider all Choose

merges…
Bottom-up (agglomerative)
• Have a distance measure on pairs of objects.

• 𝑑 𝑥, 𝑦 : Distance between 𝑥 and 𝑦
• Single linkage: dist A, B = min′ d(x, x′)

x∈𝐴,𝑥 ∈𝐵
• Complete linkage: dist A, B = max

′
d(x, x′)
x∈𝐴,𝑥 ∈𝐵
• Average linkage: dist A, B = average d(x, x′)

x∈𝐴,𝑥 ′ ∈𝐵
𝐴 |𝐵|
Computing distance
between clusters: Single
•
Link
cluster distance = distance of two
closest members in each class
• - Potentially long
and skinny
clusters
Example: single link
1 2 3 4 5
1 0 

2 2 0 

3 6 3 0 
 
4 10 9 7 0 
5  9 8 5 4 0
5
4
3
2
1
1 2 3 4 5 (1,2) 3 4 5
1 0  (1,2) 0
 
2 2 0   
 3 3 0 
3 6 3 0 
  4 9 7 0 
4 10 9 7 0   
5 8 5 4 0 
5  9 8 5 4 0
5
d (1,2),3 = min{d1,3 , d 2,3 } = min{6,3} = 3
d (1,2),4 = min{d1,4 , d 2,4 } = min{10,9} = 9 4
d (1,2),5 = min{d1,5 ,d 2,5 } = min{9,8} = 8 3
2
1
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0
   (1,2,3) 0 
2 2 0
  
 
4 7 0
3 3 0 

3 6 3 0   
4  7 0  4 9 7 0 
9
  5 5 4 0
10  5 8 5 4 0 
5  9 8 
5 4 0
5
d(1,2,3),4 = min{d(1,2),4 ,d 3,4 } = min{9,7} = 7
d (1,2,3),5 = min{d (1,2),5 , d3,5 } = min{8,5} = 5 4
3
2
1
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0
   (1,2,3) 0 
2 2 0
  
 
4 7 0
3 3 0 

3 6 3 0   
4  7 0  4 9 7 0 
9
  5 5 4 0
10  5 8 5 4 0 
5  9 8 
5 4 0
5
d (1,2,3),( 4,5) = min{d (1,2,3),4 ,d (1,2,3),5 } = 5 4
3
2
1
Computing distance
between clusters: :
Complete Link
• cluster distance = distance of two farthest
members
+ tight clusters
Computing distance
between clusters:
Average Link
• cluster distance = average distance of all
pairs
the most widely

used measure
Robust against
noise
Single linkage
Height represents 2
distance between objects 1
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
/ clusters Average linkage

Hierarchical divisive clustering
There are divisive versions of single linkage, average linkage, and
Ward’s method.
Divisive version of single linkage:
• Compute minimal spanning tree (graph connecting all the objects
with smallest total edge length.
• Break longest edge to obtain 2 subtrees, and a corresponding
partition of the objects.
• Apply process recursively to the subtrees.
Agglomerative and divisive versions of single linkage give identical
results (more later).
Divisive version of Ward’s method.
Given cluster R.
Need to find split of R into 2 groups P,Q to minimize
RSS ( P , Q) =  x i − x P +  x j − xQ
2 2
i P j Q
or, equivalently, to maximize Ward’s distance between P and Q.
Note: No computationally feasible method to find optimal P, Q for

large |R|. Have to use approximation.
Iterative algorithm to search for the optimal Ward’s split
Project observations in R on largest principal component.
Split at median to obtain initial clusters P, Q.
Repeat {
Assign each observation to cluster with closest mean
Re-compute cluster means
} Until convergence
Note:
• Each step reduces RSS(P, Q)
• No guarantee to find optimal partition.
Fuzzy Set
•Fuzzy sets are sets whose elements have degrees of

membership.
•A fuzzy set is a pair ( A , m ) where A is a set and m : A
→[0,1]
–For each x  A , m(x) is called the grade of
membership of x in (A,m). For a finite set A =
{x1,...,xn}, the fuzzy set (A,m) is often denoted
by{m(x1) / x1,...,m(xn) / xn}.
–m(x) = 0 : x is not included in (A, m)
–m(x) = 1: x is fully included in (A, m)
Fuzzy C-Means Clustering
• Fuzzy c-means (FCM) is a method of clustering

which allows one piece of data to belong to
two or more clusters
• Be frequently used in pattern recognition.
Fuzzy C-Means Clustering
• Base on minimization of the following objective function:
• m is any real number greater than 1

• uij is the degree of membership of xi in the cluster j
• xi is the i-th of d-dimensional measured data
• cj is the d-dimension center of the cluster
• ||*|| is any norm expressing the similarity between any measured
data and the center
FCM algorithm
• The algorithm is composed of the following steps
1. Initialize U=[uij] matrix, U(0)
2. At k-step: calculate the centers vectors C(k)=[cj] with
U(k)
FCM algorithm
• The algorithm is composed of the following steps

3. Update U(k) , U(k+1)
4. If ||U(k+1) - U(k)||< ε (maxij {|uij(k+1)-uij(k)|})

then STOP; otherwise return to step 2.
FCM advantages
• Gives best result for overlapped data set and

comparatively better then k-means algorithm.
• Unlike k-means where data point must
exclusively belong to one cluster center here
data point is assigned membership to each
cluster center as a result of which data point
may belong to more then one cluster center.
FCM disadvantages
• Apriori specification of the number of clusters.

• With lower value of ε we get the better result but at the
expense of more number of iteration.
• Euclidean distance measures can unequally weight
underlying factors.
1. DBSCAN
2. DENCLUE
Density-Based Clustering
Methods
• Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly
constructed density function
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
– OPTICS: Ankerst, et al (SIGMOD’99).
– CLIQUE: Agrawal, et al. (SIGMOD’98)
DBSCAN
• DBSCAN is a density-based algorithm.

– Density = number of points within a specified radius r (Eps)
– A point is a core point if it has more than a specified number of points

(MinPts) within Eps
• These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm (simplified view for teaching)
1. Create a graph whose nodes are the points to be clustered
2. For each core-point c create an edge from c to every point p in
the -neighborhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by going
forward;
1. create a cluster containing X{c}
2. N=N/(X{c})
7. Continue with step 4
Remark: points that are not assigned to any cluster are outliers;
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border

and noise
Eps = 10, MinPts = 4

When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Complexity DBSCAN
• Time Complexity: O(n2)—for each point it
has to be determined if it is a core point,
can be reduced to O(n*log(n)) in lower
dimensional spaces by using efficient data
structures (n is the number of objects to be
clustered);
• Space Complexity: O(n).
Summary DBSCAN
• Good: can detect arbitrary shapes, not very
sensitive to noise, supports outlier detection,
complexity is kind of okay, beside K-means the
second most used clustering algorithm.
• Bad: does not work well in high-dimensional
datasets, parameter selection is tricky, has
problems of identifying clusters of varying
densities (→SSN algorithm), density estimation
is kind of simplistic (→does not create a real
density function, but rather a graph of density-
connected points)
DBSCAN Algorithm Revisited
• Eliminate noise points
• Perform clustering on the remaining points:
DENCLUE
• DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)

• Major features
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
– Significant faster than existing algorithm (faster than DBSCAN
by a factor of up to 45) ????????
– But needs a large number of parameters
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually
contain data points and manages these cells in a tree-based access
structure.
• Influence function: describes the impact of a data point within its
neighborhood.
• Overall density of the data space can be calculated as the sum of the
influence function of all data points.
• Clusters can be determined using hill climbing by identifying density
attractors; density attractors are local maximal of the overall density
function.
• Objects that are associated with the same density attractor belong to the
same cluster.
Gradient: The steepness of a slope
• Example d ( x , y )2
−
f Gaussian ( x , y ) = e 2 2
d ( x , xi ) 2
−
( x ) =  i =1 e
N
D 2 2
f Gaussian
d ( x , xi ) 2
−
( x, xi ) = i =1 ( xi − x)  e
N
f D
Gaussian
2 2
Example: Density Computation
D={x1,x2,x3,x4}
fDGaussian(x)= influence(x,x1) + influence(x,x2) + influence(x,x3)

+ influence(x4)=0.04+0.06+0.08+0.6=0.78
x1
0.04 x3
0.08
y
x2 x4
0.06 x 0.6
Remark: the density value of y would be larger than the one for x
Density Attractor
Examples of DENCLUE Clusters
Basic Steps DENCLUE Algorithms
1. Determine density attractors

2. Associate data objects with density
attractors using hill climbing
3. Possibly, merge the initial clusters further
relying on a hierarchical clustering approach
(optional; not covered in this lecture)
Cluster evaluation: ground truth
🞄We use some labeled data (for classification)

🞄Assumption: Each class is a cluster.
🞄 After clustering, a confusion matrix is constructed. From
the matrix, we compute various measurements,
entropy, purity, precision, recall and F-score.
🞄Let the classes in the data D be C = (c1, c2, …, ck). The clustering
method produces k clusters, which divides D into k disjoint
subsets, D1, D2, …, Dk.
Confusion Matrix
Predicted condition
Total population
Positive (PP) Negative (PN)
=P+N
False
True positive (TP)
Positive (P) negative (FN)
Actual condition
False
True
Negative (N) positive (FP)
negative (TN)
Measures
Sensitivity = TP / (TP + FN) = TPR (True Positive Rate)
Accuracy = (TP + TN) / (TP + FP + TN + FN)

Evaluation measures: Entropy
Evaluation measures: purity
Learning of Patterns?
– Supervised learning: a teacher provides a category label

or cost for each pattern in the training set.
➡Classification
–Unsupervised learning: the system forms clusters or

natural groupings of the input patterns (based on some
similarity criteria).
➡Clustering
Supervised Training/Learning
– a “teacher” provides labeled training sets, used
to train a classifier
Classifier
A classifier partitions sample space X into class-labeled regions such that
X = X 1  X 2    X |Y | and X i  X j = {}
X1
X1 X3 X1
X2
X2 X3
The classification consists of determining to which region a feature vector x belongs to.
Borders between decision boundaries are called decision regions.
Components of classifier system
Sensors and Feature Class

Pattern Classifier
preprocessing extraction assignment
Teacher Learning algorithm
• Sensors and preprocessing.

• A feature extraction aims to create discriminative features good for classification.
• A classifier.
• A teacher provides information about hidden state -- supervised learning.
• A learning algorithm sets PR from training examples.
Basic concepts
Pattern
 x1  Feature vector x  X
x  - A vector of observations (measurements).
 2 = x
y  - x is a point in feature space X .
 
 xn 
Hidden state y Y
- Cannot be directly measured.
- Patterns with equal hidden state belong to the same class.
Task
- To design a classifer (decision rule) q : X → Y
which decides about a hidden state based on an onbservation.
Training Samples
When a few labeled patterns can be collected by experts, Supervised

learning can be opted instead of the unsupervised approach to make full
utilization of labeled patterns.
In a dataset a training set can be implemented to build up a model.

Example
height Task: American-Indian recognition.
 x1  Y = { A, I }
The set of hidden state is
x  = x The feature space is X =  2
weight  2
Training examples {( x1 , y1 ),  , ( x l , yl )}
Linear classifier: x2 y=A
 A if (w  x) + b  0
q(x) = 
 I if (w  x) + b  0
y=I ( w  x) + b = 0
x1
Minimum Distance Classifier
• Each class is represented by its class1
Test pattern
mean vector
• Training is performed by
calculating the mean of the X2
feature vectors of each class

class1
• New patterns are classified by
class2
finding the closest mean vector
• The boundary is the
X1
perpendicular bisector of the
line joining the mean points.
Minimum Distance Classifier
1
mj =
Nj
 X
X
j = 1,2,, M
j
N j = number of pixels from class  j
Dj ( X ) = X − mj (Euclidean distance)
Assign X to i if Di ( X )  D j ( X ), j = 1,2,, M ; j  i
Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used
for both classification as well as regression predictive problems.
❖ However, it is mainly used for classification predictive problems in industry.
There are three categories of learning algorithms:
1. Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase or model and uses all the data for training while classification.
2. Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm

because it doesn’t assume anything about the underlying data.
3. Eager learning algorithm - Eager learners, when given a set of training tuples, will
construct a generalization model before receiving new (e.g., test) tuples to classify.
Lazy learners
•‘Lazy’: Do not create a model of the training instances in advance
•When an instance arrives for testing, runs the algorithm to get the class
prediction
•Example, K – nearest neighbour classifier (K – NN classifier)
“One is known by the company one keeps”

Different Learning Methods
• Eager Learning
Any random movement

=>It’s a mouse
I saw a mouse!
Instance-based Learning
Its very similar to a

Desktop!!
Pattern-Based Classification: Nearest
Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably
a duck
Compute
Distance Test Record
Training Choose k of the

Records “nearest” records
175
K-NN classifier schematic
For a test instance,

1) Calculate distances from training pts.
2) Find K-nearest neighbours (say, K = 3)
3) Assign class label based on majority
KNN Algorithm:
K-nearest neighbors (KNN) algorithm training data with the help of any of
uses ‘feature similarity’ to predict the the method namely: Euclidean,
values of new data points which Manhattan or Hamming distance.
further means that the new data The most commonly used method to
point will be assigned a value based calculate distance is Euclidean.
on how closely it matches the points 3.2 − Now, based on the distance
in the training set. We can value,
understand its working with the help sort them in ascending order.
of following steps −
Step 1 − For implementing any 3.3 − Next, it will choose the top K
algorithm, we need dataset. So rows
during the first step of KNN, we must from the sorted array.
load the training as well as test data. 3.4 − Now, it will assign a class to
Step 2 − Next, we need to choose the the test
value of K i.e. the nearest data points. point based on most
K can be any integer. frequent class
Step 3 − For each point in the test of these row
data do the following − Step 4 − End
3.1 − Calculate the distance
between test data and each row of
1-Nearest Neighbor
3-Nearest Neighbor
Example-1:
• The following is an example to understand the
concept of K and working of KNN algorithm
• Suppose we have a dataset which can be plotted
as follows:
Example-1 (Conti..)
• Now, we need to classify new
data point with black dot (at
point 60,60) into blue or red
class. We are assuming K = 3
i.e. it would find three nearest
data points. It is shown in the
following diagram:
• We can see in the beside

diagram the three nearest
neighbors of the data point
with black dot. Among those
three, two of them lies in
Red class hence the black
dot will also be assigned in
red class.
Advantages
1. No Training Period: KNN is called Lazy Learner (Instance based
learning). It does not learn anything in the training period. It does
not derive any discriminative function from the training data. It
stores the training dataset and learns from it only at the time of
making real time predictions. This makes the KNN algorithm
much faster than other algorithms that require training e.g.
Linear Regression etc.
2. Since the KNN algorithm requires no training before making
predictions, new data can be added seamlessly which will not
impact the accuracy of the algorithm.
3. KNN is very easy to implement. There are only two
parameters required to implement KNN i.e. the value of K and
the distance function (e.g. Euclidean or Manhattan etc.)
Dis advantages
1. Does not work well with large dataset: In large datasets, the cost of
calculating the distance between the new point and each existing points is
huge which degrades the performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn't work
well with high dimensional data because with large number of dimensions, it
becomes difficult for the algorithm to calculate the distance in each dimension.
3. Need feature scaling: We need to do feature scaling (standardization and
normalization) before applying KNN algorithm to any dataset. If we don't do so,
KNN may generate wrong predictions.
4. Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise
in the dataset. We need to manually impute missing values and remove outliers.
K-NN classifier Issues
How to determine distances between values of categorical

attributes?
How to determine value of K?
Alternatives:
Alternatives:
1.1. Determine
Boolean distance (1 if same,The
K experimentally. 0 if different)
K that gives minimum
error is selected.
2. Differential grading (e.g. weather – ‘drizzling’ and ‘rainy’ are
closer than ‘rainy’ and ‘sunny’ )

Pattern Recognition_clustering_classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pattern Recognition_clustering_classification

Uploaded by

Copyright:

Available Formats

Pattern Recognition: Clustering

Badri Narayan Subudhi

• Pattern Recognition is the study of how machines can:

• A pattern is an abstraction, represented by a set of measurements

– Is a set of patterns sharing common attributes

– A collection of “similar”, not necessarily identical, objects

• Features are properties of an object:

– Ideally representative of a specific type (i.e. class)

Simpson's Family School Employees Females Males

• What could “similar” mean?

where μi is the mean (also called centroid) of points in Si, i.e.

• Most commonly used functions are

• If h = 1, it is the Manhattan distance

• Weighted Euclidean distance

• Chebychev distance: one wants to define two

• D(A,B) = D(B,A) Symmetry

– We can have some variations, adding weights

Technique to force the attributes to have a common value range

dist(xi , x j ) = (0.9 − 0.1)2 + (720− 20)2 = 700.000457,

The distance is almost completely dominated by (720-20) = 700.

Standardize attributes: to force the attributes to have a common

Distance between two instances x and x’, where q >= 1 is a selectable

When features are binary

🞄 Ci is the jth cluster, mj is the centroid of cluster Cj (the

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1 Iteration 2 Iteration 3

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Consider an example of five pairs of clusters

• Initial centers from different clusters may produce good clusters

• Sample and use hierarchical clustering to determine initial centroids

– Split ‘loose’ clusters, i.e., clusters with relatively high SSE

– Can use these steps during the clustering process

• Bisecting K-means algorithm

• K-means has problems when the data contains outliers.

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Original Points K-means

Original Points K-means

Original Points K-means

• The k-means algorithm is sensitive to outliers !

distribution of the data.

• Find representative objects, called medoids, in clusters

• PAM (Partitioning Around Medoids, 1987)

• CLARA (Kaufmann & Rousseeuw, 1990)

• CLARANS (Ng & Han, 1994): Randomized sampling

• Use real objects to represent the clusters (called medoids)