You are on page 1of 41

CS423

DATA WAREHOUSING AND DATA


MINING

Lecture 10
UN-SUPERVISED LEARNING

Dr. Hammad Afzal

hammad.afzal@mcs.edu.pk

Department of Computer Software Engineering


National University of Sciences and Technology (NUST)
CLUSTERING

Clustering is the partitioning of a data set into


subsets (clusters), so that the data in each
subset (ideally) share some common trait -
often according to some defined
distance measure.

Clustering is unsupervised classification

2
CLUSTERING
 There is no explicit teacher and the system forms clusters
or “natural groupings” or structure in the input pattern

3
CLUSTERING
 Data WITHOUT classes or labels

1 2 3
x , x , x ,  x n , x   d

 Deals with finding a structure in a collection of


unlabeled data.

 The process of organizing objects into groups whose


members are similar in some way

 A cluster is therefore a collection of objects which are


“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
4
CLUSTERING

 In this case we easily identify the 4 clusters into which the data
can be divided;

 The similarity criterion is distance: two or more objects belong to


the same cluster if they are “close” according to a given distance
5
TYPES OF CLUSTERING
 Hierarchical algorithms
 These find successive clusters using previously established clusters.

1. Agglomerative ("bottom-up"): Agglomerative


algorithms begin with each element as a separate cluster and
merge them into successively larger clusters.

2. Divisive ("top-down"): Divisive algorithms begin with the


whole set and proceed to divide it into successively into
smaller clusters.

6
TYPES OF CLUSTERING
 Partitional clustering
 Construct a partition of a data set to produce several
clusters – At once
 The process is repeated iteratively – Termination
condition
 Examples
 K-means clustering
 Fuzzy c-means clustering

7
K MEANS CLUSTERING
1. Chose the number (K) of clusters and randomly select the
centroids of each cluster.
2. For each data point:
I. Calculate the distance from the data point to each cluster.
II. Assign the data point to the closest cluster.
3. Recompute the centroid of each cluster.
4. Repeat steps 2 and 3 until there is no further change in the
assignment of data points (or in the centroids).

8
K MEANS – EXAMPLE 2
 Suppose we have 4 medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 clusters
of medicine

Medicine Weight pH-Index C


A 1 1
B 2 1
C 4 3 A B
D 5 4

9
K MEANS – EXAMPLE 2
 Compute the distance between all samples and K centroids
c1  A , c 2  B

d( D , c1 )  ( 5  1)2  ( 4  1)2  5
d( D , c2 )  ( 5  2)2  ( 4  1)2  4.24

10
K MEANS – EXAMPLE 2
 Assign the sample to its closest cluster
 An elements in a row of the
Group matrix below is 1 if
and only if the object is
assigned to that group

11
K MEANS – EXAMPLE 2
 Re-calculate the K-centroids

Knowing the members of each


cluster, now we compute the new
centroid of each group based on
these new memberships.

c1  (1, 1)
 2  4  5 1 3  4 
c2   , 
 3 3 
 (11 / 3, 8 / 3)
 (3.67, 2.67)

12
K MEANS – EXAMPLE 2
 Repeat the above steps

Compute the distance of all objects


to the new centroids

13
K MEANS – EXAMPLE 2

Assign the membership


to objects

14
K MEANS – EXAMPLE 2

Knowing the members of each


cluster, now we compute the new
centroid of each group based on
these new memberships.

 1 2 11 1
c1   ,   (1 , 1)
 2 2  2
 45 34 1 1
c2   ,   (4 , 3 )
 2 2  2 2

15
K MEANS – EXAMPLE 2

16
K MEANS – EXAMPLE 2

17
K MEANS – EXAMPLE 2

 We obtain result that G2=G1. Comparing the grouping of


last iteration and this iteration reveals that the objects do
not move group anymore.
 Thus, the computation of the k-mean clustering has
reached its stability and no more iterations are needed..

18
Hierarchical
Clustering

19
HIERACHICAL CLUSTERING
 Agglomerative and divisive clustering on the data set {a, b, c, d ,e }

Step 0 Step 1 Step 2 Step 3 Step 4


Agglomerative

a
ab
b
abcde
c
cde
d
de
e 20
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
AGGLOMERATIVE CLUSTERING
1. Convert object attributes to distance matrix
2. Set each object as a cluster (thus if we have N objects, we
will have N clusters at the beginning)
3. Repeat until number of cluster is one (or known # of
clusters)
a. Merge two closest clusters
b. Update distance matrix

d3
d5
d1 d3,d4,d5
d4
d2 21
d1,d2 d4,d5 d3
STARTING SITUATION
 Start with clusters of individual points and a
distance/proximity matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
.

Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12 22
INTERMEDIATE SITUATION
 After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1 Distance Matrix

C2 C5

... 23
p1 p2 p3 p4 p9 p10 p11 p12
INTERMEDIATE SITUATION
 How do we compare two clusters

C3

C4

C1

C2 C5

24
INTER CLUSTER DISTANCE MEASURES

Similarity?

 Single Link
 Average Link

 Complete Link

 Distance between centroids

25
INTERMEDIATE SITUATION
 We want to merge the two closest clusters (C2 and C5)
and update the distance matrix.
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1
Distance Matrix

C2 C5

26
...
p1 p2 p3 p4 p9 p10 p11 p12
SINGLE LINK
 Smallest distance between an element in one cluster and
an element in the other

D (ci , c j )  min D( x, y )
xci , yc j

27
COMPLETE LINK
 Largest distance between an element in one cluster and
an element in the other

D(ci , c j )  max D( x, y )
xci , yc j

28
AVERAGE LINK
 Avg distance between an element in one cluster and an
element in the other

D(ci , c j )  avg D( x, y )
xci , yc j

29
DISTANCE BETWEEN CENTROIDS
 Distance between the centroids of two clusters

 

30
AFTER MERGING
 Update the distance matrix
C2
U
C1 C5 C3 C4

C1 ?
C3
C2 U C5 ? ? ? ?
C4
C3 ?

C4 ?
C1

C2 U C5

31
...
p1 p2 p3 p4 p9 p10 p11 p12
AGGLOMERATIVE CLUSTERING -
EXAMPLE
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5

Data matrix
Dist A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
dAB = ((1-1.5)2+(1-1.5)2)1/2 = 0.707
C 5.66 4.95 0.00 2.24 1.41 2.50
Euclidean distance D 3.61 2.92 2.24 0.00 1.00 0.50
E 4.24 3.54 1.41 1.00 0.00 1.12
32
F 3.20 2.50 2.50 0.50 1.12 0.00
Merge two closest clusters

AGGLOMERATIVE CLUSTERING -
EXAMPLE
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
Merge them into
single cluster` F 3 3.5

Data matrix
Dist A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
Find two closest clusters
E 4.24 3.54 1.41 1.00 0.00 1.12
33
F 3.20 2.50 2.50 0.50 1.12 0.00
Update Distance Matrix

AGGLOMERATIVE CLUSTERING -
EXAMPLE
Dist A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
E 4.24 3.54 1.41 1.00 0.00 1.12
F 3.20 2.50 2.50 0.50 1.12 0.00

Dist A B C D,F E
A 0.00 0.71 5.66 ? 4.24
B 0.71 0.00 4.95 ? 3.54
C 5.66 4.95 0.00 ? 1.41
D,F ? ? ? 0.00 ?
E 4.24 3.54 1.41 ? 0.00
34
Update Distance Matrix

AGGLOMERATIVE CLUSTERING -
EXAMPLE
Dist A B C D E F Min Distance – Single Linkage
A 0.00 0.71 5.66 3.61 4.24 3.20
D(D,F)→A = min(dDA,dFA)=min(3.61,3.20) = 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50 D(D,F)→B = min(dDB,dFB)=min(2.92,2.50) = 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
D(D,F)→C = min(dDC,dFC)=min(2.24,2.50) = 2.24
E 4.24 3.54 1.41 1.00 0.00 1.12
F 3.20 2.50 2.50 0.50 1.12 0.00 D(D,F)→E = min(dDE,dFE)=min(1.00,1.12) = 1.00

Dist A B C D,F E Dist A B C D,F E


A 0.00 0.71 5.66 ? 4.24 A 0.00 0.71 5.66 3.20 4.24
B 0.71 0.00 4.95 ? 3.54 B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 ? 1.41 C 5.66 4.95 0.00 2.24 1.41
D,F ? ? ? 0.00 ? D,F 3.20 2.50 2.24 0.00 1.00
E 4.24 3.54 1.41 ? 0.00 E 4.24 3.54 1.41 1.00 0.00 35
Merge two closest clusters

AGGLOMERATIVE CLUSTERING -
EXAMPLE
Dist A B C D,F E
A 0.00 0.71 5.66 3.20 4.24
B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 2.24 1.41
D,F 3.20 2.50 2.24 0.00 1.00
E 4.24 3.54 1.41 1.00 0.00

Dist A,B C D,F E


A,B 0.00 ? ? ?
C ? 0.00 2.24 1.41
D,F ? 2.24 0.00 1.00
E ? 1.41 1.00 0.00

36
Update Distance Matrix

AGGLOMERATIVE CLUSTERING -
EXAMPLE
Dist A B C D,F E
A 0.00 0.71 5.66 3.20 4.24 D(A,B)→C = min(dCA,dCB)=min(5.66,4.95) = 4.95
B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 2.24 1.41 D(A,B)→(D,F) = min(dDA,dDB, dFA,dFB)
D,F 3.20 2.50 2.24 0.00 1.00 =min(3.61, 2.92, 3.20, 2.50) = 2.50

E 4.24 3.54 1.41 1.00 0.00


D(A,B)→E = min(dAE,dBE)=min(4.24,3.54) = 3.54

Dist A,B C D,F E Dist A,B C D,F E


A,B 0.00 ? ? ? A,B 0.00 4.95 2.50 3.54
C ? 0.00 2.24 1.41 C 4.95 0.00 2.24 1.41
D,F ? 2.24 0.00 1.00 D,F 2.50 2.24 0.00 1.00
E ? 1.41 1.00 0.00 E 3.54 1.41 1.00 0.00
37
Merge two closest clusters/Update Distance Matrix

AGGLOMERATIVE CLUSTERING -
EXAMPLE
Dist A,B C D,F E
A,B 0.00 4.95 2.50 3.54
C 4.95 0.00 2.24 1.41
D,F 2.50 2.24 0.00 1.00
E 3.54 1.41 1.00 0.00

Dist (A,B) C (D,F),E


(A,B) 0.00 4.95 2.50
C 4.95 0.00 1.41
(D,F),E 2.50 1.41 0.00

38
Merge two closest clusters/Update Distance Matrix

AGGLOMERATIVE CLUSTERING -
EXAMPLE
Dist (A,B) C (D,F),E
(A,B) 0.00 4.95 2.50
C 4.95 0.00 1.41
(D,F),E 2.50 1.41 0.00

Dist (A,B) ((D,F),E),C


(A,B) 0.00 2.50
((D,F),E),C 2.50 0.00

39
Final Result

AGGLOMERATIVE CLUSTERING -
EXAMPLE

X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5

Data matrix

40
Dendrogram Representation

AGGLOMERATIVE CLUSTERING -
EXAMPLE
1. In the beginning we have 6 clusters: A,
B, C, D, E and F
2. We merge cluster D and F into cluster
6 (D, F) at distance 0.50
3. We merge cluster A and cluster B into
(A, B) at distance 0.71
4. We merge cluster E and (D, F) into
((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into
5 (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and
(A, B) into ((((D, F), E), C), (A, B)) at
4 distance 2.50
3 7. The last cluster contain all the objects,
2 thus conclude the computation

41

You might also like