CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

CS423
DATA WAREHOUSING AND DATA

MINING
Lecture 10
UN-SUPERVISED LEARNING
Dr. Hammad Afzal
hammad.afzal@mcs.edu.pk
Department of Computer Software Engineering

National University of Sciences and Technology (NUST)
CLUSTERING
Clustering is the partitioning of a data set into

subsets (clusters), so that the data in each
subset (ideally) share some common trait -
often according to some defined
distance measure.
Clustering is unsupervised classification
2
CLUSTERING
 There is no explicit teacher and the system forms clusters
or “natural groupings” or structure in the input pattern
3
CLUSTERING
 Data WITHOUT classes or labels
1 2 3
x , x , x ,  x n , x   d
 Deals with finding a structure in a collection of

unlabeled data.
 The process of organizing objects into groups whose

members are similar in some way
 A cluster is therefore a collection of objects which are

“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
4
CLUSTERING
 In this case we easily identify the 4 clusters into which the data
can be divided;
 The similarity criterion is distance: two or more objects belong to

the same cluster if they are “close” according to a given distance
5
TYPES OF CLUSTERING
 Hierarchical algorithms
 These find successive clusters using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative

algorithms begin with each element as a separate cluster and
merge them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the

whole set and proceed to divide it into successively into
smaller clusters.
6
TYPES OF CLUSTERING
 Partitional clustering
 Construct a partition of a data set to produce several
clusters – At once
 The process is repeated iteratively – Termination
condition
 Examples
 K-means clustering
 Fuzzy c-means clustering
7
K MEANS CLUSTERING
1. Chose the number (K) of clusters and randomly select the
centroids of each cluster.
2. For each data point:
I. Calculate the distance from the data point to each cluster.
II. Assign the data point to the closest cluster.
3. Recompute the centroid of each cluster.
4. Repeat steps 2 and 3 until there is no further change in the
assignment of data points (or in the centroids).
8
K MEANS – EXAMPLE 2
 Suppose we have 4 medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 clusters
of medicine
Medicine Weight pH-Index C

A 1 1
B 2 1
C 4 3 A B
D 5 4
9
 Compute the distance between all samples and K centroids
c1  A , c 2  B
d( D , c1 )  ( 5  1)2  ( 4  1)2  5
d( D , c2 )  ( 5  2)2  ( 4  1)2  4.24
10
 Assign the sample to its closest cluster
 An elements in a row of the
Group matrix below is 1 if
and only if the object is
assigned to that group
11
 Re-calculate the K-centroids
Knowing the members of each

cluster, now we compute the new
centroid of each group based on
these new memberships.
c1  (1, 1)
 2  4  5 1 3  4 
c2   , 
 3 3 
 (11 / 3, 8 / 3)
 (3.67, 2.67)
12
 Repeat the above steps
Compute the distance of all objects

to the new centroids
13
Assign the membership

to objects
14
Knowing the members of each

cluster, now we compute the new
centroid of each group based on
these new memberships.
 1 2 11 1
c1   ,   (1 , 1)
 2 2  2
 45 34 1 1
c2   ,   (4 , 3 )
 2 2  2 2
15
16
17
 We obtain result that G2=G1. Comparing the grouping of

last iteration and this iteration reveals that the objects do
not move group anymore.
 Thus, the computation of the k-mean clustering has
reached its stability and no more iterations are needed..
18
Hierarchical
Clustering
19
HIERACHICAL CLUSTERING
 Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4

Agglomerative
a
ab
b
abcde
c
cde
d
de
e 20
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
AGGLOMERATIVE CLUSTERING
1. Convert object attributes to distance matrix
2. Set each object as a cluster (thus if we have N objects, we
will have N clusters at the beginning)
3. Repeat until number of cluster is one (or known # of
clusters)
a. Merge two closest clusters
b. Update distance matrix
d3
d5
d1 d3,d4,d5
d4
d2 21
d1,d2 d4,d5 d3
STARTING SITUATION
 Start with clusters of individual points and a
distance/proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
.
Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12 22
INTERMEDIATE SITUATION
 After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Distance Matrix
C2 C5
... 23
p1 p2 p3 p4 p9 p10 p11 p12
 How do we compare two clusters
C3
C4
C1
C2 C5
24
INTER CLUSTER DISTANCE MEASURES
Similarity?
 Single Link
 Average Link
 Complete Link
 Distance between centroids
25
 We want to merge the two closest clusters (C2 and C5)
and update the distance matrix.
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Distance Matrix
C2 C5
26
...
p1 p2 p3 p4 p9 p10 p11 p12
SINGLE LINK
 Smallest distance between an element in one cluster and
an element in the other
D (ci , c j )  min D( x, y )
xci , yc j
27
COMPLETE LINK
 Largest distance between an element in one cluster and
an element in the other
D(ci , c j )  max D( x, y )
xci , yc j
28
AVERAGE LINK
 Avg distance between an element in one cluster and an
element in the other
D(ci , c j )  avg D( x, y )
xci , yc j
29
DISTANCE BETWEEN CENTROIDS
 Distance between the centroids of two clusters
 
30
AFTER MERGING
 Update the distance matrix
C2
U
C1 C5 C3 C4
C1 ?
C3
C2 U C5 ? ? ? ?
C4
C3 ?
C4 ?
C1
C2 U C5
31
...
p1 p2 p3 p4 p9 p10 p11 p12
AGGLOMERATIVE CLUSTERING -
EXAMPLE
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Data matrix
Dist A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
dAB = ((1-1.5)2+(1-1.5)2)1/2 = 0.707
C 5.66 4.95 0.00 2.24 1.41 2.50
Euclidean distance D 3.61 2.92 2.24 0.00 1.00 0.50
E 4.24 3.54 1.41 1.00 0.00 1.12
32
F 3.20 2.50 2.50 0.50 1.12 0.00
Merge two closest clusters
EXAMPLE
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
Merge them into
single cluster` F 3 3.5
Data matrix
Dist A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
Find two closest clusters
E 4.24 3.54 1.41 1.00 0.00 1.12
33
F 3.20 2.50 2.50 0.50 1.12 0.00
Update Distance Matrix
EXAMPLE
Dist A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
E 4.24 3.54 1.41 1.00 0.00 1.12
F 3.20 2.50 2.50 0.50 1.12 0.00
Dist A B C D,F E
A 0.00 0.71 5.66 ? 4.24
B 0.71 0.00 4.95 ? 3.54
C 5.66 4.95 0.00 ? 1.41
D,F ? ? ? 0.00 ?
E 4.24 3.54 1.41 ? 0.00
34
EXAMPLE
Dist A B C D E F Min Distance – Single Linkage
A 0.00 0.71 5.66 3.61 4.24 3.20
D(D,F)→A = min(dDA,dFA)=min(3.61,3.20) = 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50 D(D,F)→B = min(dDB,dFB)=min(2.92,2.50) = 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
D(D,F)→C = min(dDC,dFC)=min(2.24,2.50) = 2.24
E 4.24 3.54 1.41 1.00 0.00 1.12
F 3.20 2.50 2.50 0.50 1.12 0.00 D(D,F)→E = min(dDE,dFE)=min(1.00,1.12) = 1.00
Dist A B C D,F E Dist A B C D,F E

A 0.00 0.71 5.66 ? 4.24 A 0.00 0.71 5.66 3.20 4.24
B 0.71 0.00 4.95 ? 3.54 B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 ? 1.41 C 5.66 4.95 0.00 2.24 1.41
D,F ? ? ? 0.00 ? D,F 3.20 2.50 2.24 0.00 1.00
E 4.24 3.54 1.41 ? 0.00 E 4.24 3.54 1.41 1.00 0.00 35
Merge two closest clusters
EXAMPLE
Dist A B C D,F E
A 0.00 0.71 5.66 3.20 4.24
B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 2.24 1.41
D,F 3.20 2.50 2.24 0.00 1.00
E 4.24 3.54 1.41 1.00 0.00
Dist A,B C D,F E

A,B 0.00 ? ? ?
C ? 0.00 2.24 1.41
D,F ? 2.24 0.00 1.00
E ? 1.41 1.00 0.00
36
EXAMPLE
Dist A B C D,F E
A 0.00 0.71 5.66 3.20 4.24 D(A,B)→C = min(dCA,dCB)=min(5.66,4.95) = 4.95
B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 2.24 1.41 D(A,B)→(D,F) = min(dDA,dDB, dFA,dFB)
D,F 3.20 2.50 2.24 0.00 1.00 =min(3.61, 2.92, 3.20, 2.50) = 2.50
E 4.24 3.54 1.41 1.00 0.00

D(A,B)→E = min(dAE,dBE)=min(4.24,3.54) = 3.54
Dist A,B C D,F E Dist A,B C D,F E

A,B 0.00 ? ? ? A,B 0.00 4.95 2.50 3.54
C ? 0.00 2.24 1.41 C 4.95 0.00 2.24 1.41
D,F ? 2.24 0.00 1.00 D,F 2.50 2.24 0.00 1.00
E ? 1.41 1.00 0.00 E 3.54 1.41 1.00 0.00
37
Merge two closest clusters/Update Distance Matrix
EXAMPLE
Dist A,B C D,F E
A,B 0.00 4.95 2.50 3.54
C 4.95 0.00 2.24 1.41
D,F 2.50 2.24 0.00 1.00
E 3.54 1.41 1.00 0.00
Dist (A,B) C (D,F),E

(A,B) 0.00 4.95 2.50
C 4.95 0.00 1.41
(D,F),E 2.50 1.41 0.00
38
Merge two closest clusters/Update Distance Matrix
EXAMPLE
Dist (A,B) C (D,F),E
(A,B) 0.00 4.95 2.50
C 4.95 0.00 1.41
(D,F),E 2.50 1.41 0.00
Dist (A,B) ((D,F),E),C

(A,B) 0.00 2.50
((D,F),E),C 2.50 0.00
39
Final Result
EXAMPLE
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Data matrix
40
Dendrogram Representation
EXAMPLE
1. In the beginning we have 6 clusters: A,
B, C, D, E and F
2. We merge cluster D and F into cluster
6 (D, F) at distance 0.50
3. We merge cluster A and cluster B into
(A, B) at distance 0.71
4. We merge cluster E and (D, F) into
((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into
5 (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and
(A, B) into ((((D, F), E), C), (A, B)) at
4 distance 2.50
3 7. The last cluster contain all the objects,
2 thus conclude the computation
41

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

Copyright:

Available Formats

CS423

DATA WAREHOUSING AND DATA

Dr. Hammad Afzal

Department of Computer Software Engineering

Clustering is the partitioning of a data set into

Clustering is unsupervised classification

 Deals with finding a structure in a collection of

 The process of organizing objects into groups whose

 A cluster is therefore a collection of objects which are

 The similarity criterion is distance: two or more objects belong to

1. Agglomerative ("bottom-up"): Agglomerative

2. Divisive ("top-down"): Divisive algorithms begin with the

Medicine Weight pH-Index C

Knowing the members of each

Compute the distance of all objects

Assign the membership

Knowing the members of each

 We obtain result that G2=G1. Comparing the grouping of

Step 0 Step 1 Step 2 Step 3 Step 4

 Distance between centroids

Dist A B C D,F E Dist A B C D,F E

Dist A,B C D,F E

E 4.24 3.54 1.41 1.00 0.00

Dist A,B C D,F E Dist A,B C D,F E

Dist (A,B) C (D,F),E

Dist (A,B) ((D,F),E),C

You might also like