Professional Documents
Culture Documents
Partitioning Algorithms: Basic Concepts: Partition N Objects Into K Clusters
Partitioning Algorithms: Basic Concepts: Partition N Objects Into K Clusters
i 1 i 1 pCi
1
Example of Square Error of Cluster
Ci={P1, P2, P3}
10
9 P1 = (3, 7)
8 P1 P2 = (2, 3)
7 P3 = (7, 5)
6 P3 mi = (4, 5)
5
4 P2 mi
|d(P1, mi)|2
3
=(3-4)2+(7-5)2=5
2
|d(P2, mi)|2=8
1
|d(P3, mi)|2=9
0 1 2 3 4 5 6 7 8 9 10
Error (Ci)=5+8+9=22 2
Example of Square Error of Cluster
Cj={P4, P5, P6}
10
9 P4 = (4, 6)
8 P5 = (5, 5)
7 P4 P6 = (3, 4)
6 P5 mj = (4, 5)
5
4 mj
P6 |d(P4, mj)|2
3
=(4-4)2+(6-5)2=1
2
|d(P5, mj)|2=1
1
|d(P6, mj)|2=1
0 1 2 3 4 5 6 7 8 9 10
Error (Cj)=1+1+1=3 3
Partitioning Algorithms: Basic Concepts
Global optimal: examine all possible partitions
kn possible partitions, too expensive!
Heuristic methods: k-means and k-medoids
k-means (MacQueen’67): Each cluster is
represented by center of cluster
k-medoids (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects
(medoid) in cluster
4
K-means
Initialization
Arbitrarily choose k objects as the initial cluster centers
(centroids)
Iteration until no change
For each object Oi
Calculate the distances between O and the k centroids
i
(Re)assign O to the cluster whose centroid is the closest
i
to Oi
Update the cluster centroids based on current assignment
5
k-Means Clustering Method cluster
10 current 10
mean
9 clusters 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
objects
new relocated
clusters
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
6
Example
For simplicity, 1 dimensional objects and k=2.
Objects: 1, 2, 5, 6,7
K-means:
Randomly select 5 and 6 as initial centroids;
=> Two clusters {1,2,5} and {6,7}; meanC1=8/3,
meanC2=6.5
=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
=> no change.
Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 =
2.5
7
Variations of k-Means Method
Aspects of variants of k-means
Selection of initial k centroids
E.g., choose k farthest points
Dissimilarity calculations
E.g., use Manhattan distance
Strategies to calculate cluster means
E.g., update the means incrementally
8
Strengths of k-Means Method
Strength
Relatively efficient for large datasets
O(tkn) where n is # objects, k is # clusters, and t is #
9
Weakness of k-Means Method
Weakness
Applicable only when mean is defined, then what about
categorical data?
k-modes algorithm
Unable to handle noisy data and outliers
k-medoids algorithm
Need to specify k, number of clusters, in advance
Hierarchical algorithms
Density-based algorithms
10
k-modes Algorithm age income student credit_rating
< = 30 high no fair
Handling categorical data: < = 30 high no excellent
31…40 high no fair
k-modes (Huang’98) > 40 medium no fair
Replacing means of > 40 low yes fair
> 40 low yes excellent
clusters with modes
31…40 low yes excellent
Given n records in
< = 30 medium no fair
cluster, mode is record < = 30 low yes fair
made up of most > 40 medium yes fair
< = 30 medium yes excellent
frequent attribute 31…40 medium no excellent
values 31…40 high yes fair
In the example cluster, mode = (<=30, medium, yes, fair)
Using new dissimilarity measures to deal with
categorical objects
11
A Problem of K-means
Sensitive to outliers
Outlier: objects with extremely large (or small) values
May substantially distort the distribution of the data
+
+
Outlier
12
k-Medoids Clustering Method
k-medoids: Find k representative objects, called medoids
PAM (Partitioning Around Medoids, 1987)
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means k-medoids
13
PAM (Partitioning Around Medoids) (1987)
i 1 pCi
Compute Eh-Em
Negative: swapping brings benefit
Choose the minimum swapping cost
15
Four Swapping Cases
When a medoid m is to be swapped with a non-medoid
object h, check each of other non-medoid objects j
j is in cluster of m reassign j
Case 1: j is closer to some k than to h; after swapping m and
h, j relocates to cluster represented by k
Case 2: j is closer to h than to k; after swapping m and h, j
is in cluster represented by h
j is in cluster of some k, not m compare k with h
Case 3: j is closer to some k than to h; after swapping m and
is in cluster represented by h
16
PAM Clustering: Total swapping cost TCmh=jCjmh
Case 1 10 Case 3 10
9 9
j
8
h 8
k
7
6
j 7
h
5
4 m k 4
2
3
2
m
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
3
m 4
h j
3
2
1
2
1
k
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
18
Strength and Weakness of PAM
19
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of data set, applies PAM on
each sample, gives best clustering as output
Handle larger data sets than PAM (1,000 objects in 10
clusters)
Efficiency and effectiveness depends on the sampling
20
CLARA - Algorithm
Set mincost to MAXIMUM;
Repeat q times // draws q samples
Create S by drawing s objects randomly from D;
Generate the set of medoids M from S by applying the
PAM algorithm;
Compute cost(M,D)
If cost(M, D)<mincost
Mincost = cost(M, D);
Bestset = M;
Endif;
Endrepeat;
Return Bestset;
21
Complexity of CLARA
Set mincost to MAXIMUM; O(1)
Repeat q times O((s-k)2*k+(n-k)*k)
Create S by drawing s objects
randomly from D; O(1)
Generate the set of medoids M
from S by applying the PAM
algorithm; O((s-k)2*k)
Compute cost(M,D) O((n-k)*k)
If cost(M, D)<mincost O(1)
Mincost = cost(M, D);
Bestset = M;
Endif;
Endrepeat;
Return Bestset; 22
Strengths and Weaknesses of CLARA
Strength:
Handle larger data sets than PAM (1,000 objects in 10
clusters)
Weakness:
Efficiency depends on sample size
A good clustering based on samples will not necessarily
represent a good clustering of whole data set if sample is
biased
23
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on
Randomized Search) (Ng and Han’94)
CLARANS draws sample in solution space dynamically
A solution is a set of k medoids
The solutions space contains n solutions in total
k
The solution space can be represented by a graph where
every node is a potential solution, i.e., a set of k medoids
24
Graph Abstraction
Every node is a potential solution (k-medoid)
Every node is associated with a squared error
Two nodes are adjacent if they differ by one medoid
Every node has k(nk) adjacent nodes
{O1,O2,…,Ok}
k(n k)
{Ok+1,O2,…,Ok}
… {Ok+n,O2,…,Ok}
… neighbors for
one node
26
CLARANS Compare no more than
maxneighbor times
N C N
N
N
<
C
… Local
minimum
N N numlocal
… Local
minimum
… Local
minimum
Best Node
… Local
minimum
27
CLARANS - Algorithm
Set mincost to MAXIMUM;
For i=1 to h do // find h local optimum
Randomly select a node as the current node C in the graph;
J = 1; // counter of neighbors
Repeat
Randomly select a neighbor N of C;
If Cost(N,D)<Cost(C,D)
Assign N as the current node C;
J = 1;
Else J++;
Endif;
Until J > m
Update mincost with Cost(C,D) if applicableEnd for;
End For
Return bestnode;
28
Graph Abstraction (k-means, k-modes, k-medoids)
Each vertex is a set of k-representative objects (means,
modes, medoids)
Each iteration produces a new set of k-representative
objects with lower overall dissimilarity
Iterations correspond to a hill descent process in a
landscape (graph) of vertices
29
Comparison with PAM
Search for minimum in graph (landscape)
At each step, all adjacent vertices are examined; the one
with deepest descent is chosen as next k-medoids
Search continues until minimum is reached
For large n and k values (n=1,000, k=10), examining all
k(nk) adjacent vertices is time consuming; inefficient
for large data sets
CLARANS vs PAM
For large and medium data sets, it is obvious that
CLARANS is much more efficient than PAM
For small data sets, CLARANS outperforms PAM
significantly
30
When n=80,
CLARANS is 5
times faster
than PAM,
while the
cluster quality
is the same.
31
Comparision with CLARA
CLARANS vs CLARA
CLARANS is always able to find clusterings of better
quality than those found by CLARA; CLARANS may use
much more time than CLARA
When the time used is the same, CLARANS is still better
than CLARA
32
33
Hierarchies of Co-expressed Genes and Coherent Patterns
The interpretation of
co-expressed genes
and coherent patterns
mainly depends on the
domain knowledge
34
A Subtle Situation
group A1
group A2
group A
35