Professional Documents
Culture Documents
A Problem of K-means
Sensitive to outliers
Outlier: objects with extremely large (or small) values
May substantially distort the distribution of the data
+
+
Outlier
2
Limitations of K-means: Outlier Problem
• Since an object with an extremely large value may substantially distort the
• Solution: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster .
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-Medoids Clustering Method
k-medoids: Find k representative objects, called medoids
PAM (Partitioning Around Medoids, 1987)
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means k-medoids
4
MEDOID
1.A medoid can be defined as the point in the
cluster, whose similarities with all the other points in
the cluster is maximum.
6
A Typical K-Medoids Algorithm (PAM)
Total Cost ({mi}) = 20
10 10
10
9 9
9
8 8
8
7 7
6
Arbitrary 6
Assign 7
5
choose k 5 each 6
5
4 object as 4 remaining 4
3
initial 3
object to 3
2
medoids 2
nearest 2
1
medoids
1
1
0 (mi,i=1..k) 0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
K=2
select a nonmedoid
Total Cost = 18 object,Oj
10 10
9
Compute 9
Swapping mt and Oj 8
total cost of
8
7 7
4 4
3
medoids 3
(Oj,{mi},i≠t)
2 2
1 1
Do loop 0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
Until no change 7
The k-Medoids Algorithm
Total swapping Cost (TCih)
– Where Cjih is the cost of swapping i with h for all non medoid objects j
𝑦3
𝑥2 𝑦2
𝑥3
𝑥1 C2
C1
𝑦1
K-MEDOIDS ALGORITHM
EXAMPLE-1
K-MEDOIDS ALGORITHM
EXAMPLE-1
Select the Randomly K-Medoids
K-MEDOIDS ALGORITHM
EXAMPLE-1
Allocate to Each Point to Closest Medoid
K-MEDOIDS ALGORITHM
EXAMPLE-1
Allocate to Each Point to Closest Medoid
K-MEDOIDS ALGORITHM
EXAMPLE-1
Allocate to Each Point to Closest Medoid
K-MEDOIDS ALGORITHM
EXAMPLE-1
Determine New Medoid for Each Cluster
K-MEDOIDS ALGORITHM
EXAMPLE-1
Determine New Medoid for Each Cluster
K-MEDOIDS ALGORITHM
EXAMPLE-1
Allocate to Each Point to Closest Medoid
K-MEDOIDS ALGORITHM
EXAMPLE-1
Stop Process
Case (i) : Computation of Cjih
10
8
t
7
Therefore, in future, j j
6
belongs to the cluster h
5
represented by h
4
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
𝑘
𝐸=∑ ∑ 𝑑( 𝑗 ,𝑖) Cjih = d(j, h) - d(j, i)
𝑖=1 𝑗∈𝐶 𝑖
Case (ii) : Computation of Cjih
10
8 h
Therefore, in future, j 7 j
belongs to the cluster 6
represented by t 5
4
i t
3
0
0 1 2 3 4 5 6 7 8 9 10
10
8
t j
7
Therefore, in future, j
6
belongs to the cluster h
5
represented by t itself
4
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
Cjih = 0
Case (iv) : Computation of Cjih
10
8
t
7
Therefore, in future, j j
6
belongs to the cluster h
5
represented by h
4
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
Data Objects 9
8
3
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 Goal: create two clusters
7 4
09 8 5 Choose randmly two medoids
010 7 6
08 = (7,4) and 02 = (3,4)
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Assign each object to the closest representative object
08 7 4
09 Using L1 Metric (Manhattan), we form the following clusters
8 5
010 7 6 Cluster1 = {01, 02, 03, 04}
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
Compute the absolute error criterion [for the set of Medoids
08 7 4
09 (O2,O8)]
8 5
𝒌
010 7 6
𝑬=∑ ∑ |𝒑− 𝑶 𝒊|=(|𝑶 𝟏 − 𝑶 𝟐|+|𝑶 𝟑 − 𝑶 𝟐|+|𝑶 𝟒 − 𝑶 𝟐|)+¿ ¿
𝒊=𝟏 𝒑 ∈𝑪 𝒊
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 The absolute error criterion [for the set of Medoids (O2,O8)]
7 4
09 8 5
010 7 6 E = (3+4+4)+(3+1+1+2+2) = 20
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 • Choose a random object 07
08 7 4
• Swap 08 and 07
09 8 5
010 • Compute the absolute error criterion [for the set of
7 6
Medoids (02,07)
E = (3+4+4)+(2+2+1+3+3) = 22
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 → Compute the cost function
08 7 4
Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ]
09 8 5
010 7 6
S=22-20
S> 0 => It is a bad idea to replace 08 by 07
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 C687 = d(07,06) - d(08,06)=2-1=1
08 7 4
C587 = d(07,05) - d(08,05)=2-3=-1 C187=0, C387=0, C487=0
09 8 5
010 C987 = d(07,09) - d(08,09)=3-2=1
7 6
C1087 = d(07,010) - d(08,010)=3-2=1
TCih=jCjih=1-1+1+1=2=S
CLARA (Clustering Large Applications) (1990)
CLARA (Clustering LARge Applications, Kaufmann and Rousseeuw)
Sampling based method:
It draws multiple samples of the data set,
applies PAM on each sample,
gives the best clustering as the output (minimizing cost/SSE)
31