Professional Documents
Culture Documents
• What is a pattern?
• What kinds of category we have?
What is a pattern?
Clustering is subjective
Peter Piotr
0.23 3 342.7
Clustering examples
Image segmentation
Goal: Break up the image into meaningful or
perceptually similar regions
Clustering gene expression data
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Different Clustering
• Partitioning Clustering: K-mean, K-medoid, PAM
• Fuzzy Clustering: FCM
• Hierarchical Clustering: AGNES, DIANA
• Density-Based Clustering.: DBSCAN, Mean-shift
• OPTICS (Ordering Points to Identify Clustering
Structure)
• Kernelized Clustering
• Probabilistic clustering
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X Rr, and r is the number of
attributes (dimensions) in the data.
• The k-means algorithm partitions the given
data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means Algorithm 1:
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
Mathematical Perspective
Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional
real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S =
{S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance).
Formally, the objective is to find:
|Si |is the size of Si, and ||..|| is the usual L2 norm
Clustering criterion ..
1. Similarity/distance function
2. Stopping criterion
3. Cluster Quality
1. Distance functions for numeric attributes
D(A,B) = D(B,A)
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like
Alex.”
D(A,A) = 0
Otherwise you could claim “Alex looks more like Bob, than Bob does.”
Distance functions for binary and
nominal attributes
• Binary attribute: has two values or states but
no ordering relationships, e.g.,
– Gender: male and female.
– Weather: rain, sunny
• We use a confusion matrix to introduce the
distance functions/measures.
• Let the ith and jth data points be xi and xj
(vectors)
Confusion matrix
Symmetric binary attributes
• A binary attribute is symmetric if both of its
states (0 and 1) have equal importance, and
carry the same weights, e.g., male and female
of the attribute Gender
• Distance function: Simple Matching
Coefficient, proportion of mismatches of their
values
Symmetric binary attributes: example
Asymmetric binary attributes
• Asymmetric: if one of the states is more
important or more valuable than the other.
– By convention, state 1 represents the more
important state, which is typically the rare or
infrequent state.
– Jaccard coefficient is a popular measure
Z-score:
Is normalization desirable?
Other distance/similarity measures
+
+
An example (cont …)
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from
each cluster is small.
– Chance is relatively small when K is large
– Sometimes the initial centroids will readjust themselves in ‘right’ way, and
sometimes they don’t
• Select more than k initial centroids and then select among these initial centroids
– Select most widely separated
• Post-processing
• Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Merge clusters that are ‘close’ and that have relatively low SSE
– Densities
– Non-globular shapes
– Since an object with an extremely large value may substantially distort the
• Solution: Instead of taking the mean value of the object in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K-medoid Algorithm:
1. Decide on a value for k.
2. Initialize the k cluster centers (Physical objects in data).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center (medoid).
4. Re-estimate the k cluster centers, by computing the
median of data in a cluster.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-medoid Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
k1
4
k2
2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
The K-Medoids Clustering Method
– starts from an initial set of medoids and iteratively replaces one of the medoids by
one of the non-medoids if it improves the total distance of the resulting clustering.
– PAM works effectively for small data sets, but does not scale well for large data sets.
2. For each pair of selected object (i) and non-selected object (h), calculate the
2. Then assign each non-selected object to the most similar representative object
𝑦3
𝑥2 𝑦2
𝑥3
C2
𝑥1
C1
𝑦1
Data Objects 9
8
3
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
Goal: create two clusters
08 7 4
09 8 5 Choose randmly two medoids
010 7 6
08 = (7,4) and 02 = (3,4)
PAM or K-Medoids: Example
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Assign each object to the closest representative object
08 7 4
09 8 5 Using L1 Metric (Manhattan), we form the following clusters
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Compute the absolute error criterion [for the set of
08 7 4 Medoids (O2,O8)]
09 8 5
𝒌
010 7 6
𝑬 = 𝒑 − 𝑶𝒊 = 𝑶𝟏 − 𝑶𝟐 + 𝑶𝟑 − 𝑶𝟐 + 𝑶𝟒 − 𝑶𝟐 +
𝒊=𝟏 𝒑∈𝑪𝒊
𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖
PAM or K-Medoids: Example
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 7 4 The absolute error criterion [for the set of Medoids (O2,O8)]
09 8 5
010 7 6
E = (3+4+4)+(3+1+1+2+2) = 20
PAM or K-Medoids: Example
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 • Choose a random object 07
08 7 4
• Swap 08 and 07
09 8 5
• Compute the absolute error criterion [for the set of
010 7 6
Medoids (02,07)
E = (3+4+4)+(2+2+1+3+3) = 22
PAM or K-Medoids: Example
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 → Compute the cost function
08 7 4
Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ]
09 8 5
010 7 6
S=22-20
S> 0 => It is a bad idea to replace 08 by 07
PAM or K-Medoids: Example
Data Objects 9
8 cluster1
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5 9
02 3 4
4
03 3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 C687 = d(07,06) - d(08,06)=2-1=1
08 7 4 C587 = d(07,05) - d(08,05)=2-3=-1
C187=0, C387=0, C487=0
09 8 5 C987 = d(07,09) - d(08,09)=3-2=1
010 7 6
C1087 = d(07,010) - d(08,010)=3-2=1
TCih=jCjih=1-1+1+1=2=S
What Is the Problem with PAM?
• PAM is more robust than k-means in the presence of noise and outliers because a
• PAM works efficiently for small data sets but does not scale well for large data
sets.
• It draws multiple samples of the data set, applies PAM on each sample, and gives
• Weakness:
• CLARA draws a sample of the dataset and applies PAM on the sample in order
• If the sample is best representative of the entire dataset then the medoids of the
– Note that the algorithm cannot find the best solution if one of the best k-
PAM
sample
CLARA (Clustering Large Applications)
• Do such l-times and after l-times the medoid are local optimum
• Perform the randomized operation m times and return the best local
optimum as final results.
CLARANS Properties
• Advantages
– Experiments show that CLARANS is more effective than both PAM and CLARA
– Handles outliers
• Disadvantages
– The computational complexity of CLARANS is O(n2), where n is the number of objects
0 8 8 7 7
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
merges… … the best
But how do we compute distances
between clusters rather than
Consider all objects? Choose
possible
merges… … the best
𝐴 |𝐵|
Computing distance
between clusters: Single
•
Link
cluster distance = distance of two
closest members in each class
• - Potentially long
and skinny
clusters
Example: single link
1 2 3 4 5
1 0
2 2 0
3 6 3 0
4 10 9 7 0
5 9 8 5 4 0
5
4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5
1 0 (1,2) 0
2 2 0
3 3 0
3 6 3 0
4 9 7 0
4 10 9 7 0
5 8 5 4 0
5 9 8 5 4 0
5
d (1,2),3 = min{d1,3 , d 2,3 } = min{6,3} = 3
d (1,2),4 = min{d1,4 , d 2,4 } = min{10,9} = 9 4
d (1,2),5 = min{d1,5 ,d 2,5 } = min{9,8} = 8 3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0 (1,2) 0
(1,2,3) 0
2 2 0
4 7 0
3 3 0
3 6 3 0
4 7 0 4 9 7 0
9
5 5 4 0
10 5 8 5 4 0
5 9 8
5 4 0
5
d(1,2,3),4 = min{d(1,2),4 ,d 3,4 } = min{9,7} = 7
d (1,2,3),5 = min{d (1,2),5 , d3,5 } = min{8,5} = 5 4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0 (1,2) 0
(1,2,3) 0
2 2 0
4 7 0
3 3 0
3 6 3 0
4 7 0 4 9 7 0
9
5 5 4 0
10 5 8 5 4 0
5 9 8
5 4 0
5
d (1,2,3),( 4,5) = min{d (1,2,3),4 ,d (1,2,3),5 } = 5 4
3
2
1
Computing distance
between clusters: :
Complete Link
• cluster distance = distance of two farthest
members
+ tight clusters
Computing distance
between clusters:
Average Link
• cluster distance = average distance of all
pairs
Height represents 2
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
RSS ( P , Q) = x i − x P + x j − xQ
2 2
i P j Q
Note:
• Each step reduces RSS(P, Q)
• No guarantee to find optimal partition.
Fuzzy Set
– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm (simplified view for teaching)
1. Create a graph whose nodes are the points to be clustered
2. For each core-point c create an edge from c to every point p in
the -neighborhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by going
forward;
1. create a cluster containing X{c}
2. N=N/(X{c})
7. Continue with step 4
Remark: points that are not assigned to any cluster are outliers;
DBSCAN: Core, Border and Noise Points
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Complexity DBSCAN
• Time Complexity: O(n2)—for each point it
has to be determined if it is a core point,
can be reduced to O(n*log(n)) in lower
dimensional spaces by using efficient data
structures (n is the number of objects to be
clustered);
• Space Complexity: O(n).
Summary DBSCAN
• Good: can detect arbitrary shapes, not very
sensitive to noise, supports outlier detection,
complexity is kind of okay, beside K-means the
second most used clustering algorithm.
• Bad: does not work well in high-dimensional
datasets, parameter selection is tricky, has
problems of identifying clusters of varying
densities (→SSN algorithm), density estimation
is kind of simplistic (→does not create a real
density function, but rather a graph of density-
connected points)
DBSCAN Algorithm Revisited
• Eliminate noise points
• Perform clustering on the remaining points:
DENCLUE
• Example d ( x , y )2
−
f Gaussian ( x , y ) = e 2 2
d ( x , xi ) 2
−
( x ) = i =1 e
N
D 2 2
f Gaussian
d ( x , xi ) 2
−
( x, xi ) = i =1 ( xi − x) e
N
f D
Gaussian
2 2
Example: Density Computation
D={x1,x2,x3,x4}
x1
0.04 x3
0.08
y
x2 x4
0.06 x 0.6
Remark: the density value of y would be larger than the one for x
Density Attractor
Examples of DENCLUE Clusters
Basic Steps DENCLUE Algorithms
Predicted condition
Total population
Positive (PP) Negative (PN)
=P+N
False
True positive (TP)
Positive (P) negative (FN)
Actual condition
False
True
Negative (N) positive (FP)
negative (TN)
Measures
➡Classification
➡Clustering
Supervised Training/Learning
– a “teacher” provides labeled training sets, used
to train a classifier
Classifier
A classifier partitions sample space X into class-labeled regions such that
X = X 1 X 2 X |Y | and X i X j = {}
X1
X1 X3 X1
X2
X2 X3
The classification consists of determining to which region a feature vector x belongs to.
Borders between decision boundaries are called decision regions.
Components of classifier system
Task
- To design a classifer (decision rule) q : X → Y
which decides about a hidden state based on an onbservation.
Training Samples
x1 Y = { A, I }
The set of hidden state is
x = x The feature space is X = 2
weight 2
Training examples {( x1 , y1 ), , ( x l , yl )}
A if (w x) + b 0
q(x) =
I if (w x) + b 0
y=I ( w x) + b = 0
x1
Minimum Distance Classifier
• Each class is represented by its class1
Test pattern
mean vector
• Training is performed by
calculating the mean of the X2
1
mj =
Nj
X
X
j = 1,2,, M
j
Dj ( X ) = X − mj (Euclidean distance)
Assign X to i if Di ( X ) D j ( X ), j = 1,2,, M ; j i
Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used
for both classification as well as regression predictive problems.
1. Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase or model and uses all the data for training while classification.
3. Eager learning algorithm - Eager learners, when given a set of training tuples, will
construct a generalization model before receiving new (e.g., test) tuples to classify.
Lazy learners
•When an instance arrives for testing, runs the algorithm to get the class
prediction
I saw a mouse!
Instance-based Learning
175
K-NN classifier schematic
Alternatives:
Alternatives:
1.1. Determine
Boolean distance (1 if same,The
K experimentally. 0 if different)
K that gives minimum
error is selected.
2. Differential grading (e.g. weather – ‘drizzling’ and ‘rainy’ are
closer than ‘rainy’ and ‘sunny’ )