Professional Documents
Culture Documents
DWDM Unit3
DWDM Unit3
Lecture Topic
*************************************************
Lecture-41 What is Cluster Analysis?
Lecture-42 Types of Data in Cluster Analysis
Lecture-43 A Categorization of Major Clustering
Methods
Lecture-44 Partitioning Methods
Lecture-45 Hierarchical Methods
Lecture-46 Density-Based Methods
Lecture-47 Grid-Based Methods
Lecture-48 Model-Based Clustering Methods
Lecture-49 Outlier Analysis
Lecture-41
What is Cluster Analysis?
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science ( market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Lecture-41 - What is Cluster Analysis?
Examples of Clustering Applications
0
d(2,1)
Dissimilarity matrix 0
d(3,1) d ( 3,2) 0
(one mode)
: : :
d ( n,1) d ( n,2) ... ... 0
Interval-scaled variables
Binary variables
Categorical, Ordinal, and Ratio Scaled
variables
Variables of mixed types
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip j p
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.
Lecture-42 - Types of Data in Cluster Analysis
Binary Variables
A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a c b d p
d (i, j) p
p
m
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal
states
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
h
8
8
j
7
7
6
6
5 i
i
5
4
t
4
3
h j
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
dynamic modeling
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
y
y y
x
y y
x x
x x
Lecture-45 - Hierarchical Methods
Cure: Shrinking Representative
y Points
y
x x
Data Set
Merge Partition
Final Clusters
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
Density-reachable:
p
A point p is density-reachable from a
point q wrt. Eps, MinPts if there is a p1
q
chain of points p1, …, pn, p1 = q, pn =
p such that pi+1 is directly density-
reachable from pi
Density-connected
p q
A point p is density-connected to a
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o wrt. Eps
and MinPts.
Lecture-46 - Density-Based Methods
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Border
Eps = 1cm
Core MinPts = 5
undefined
‘
Cluster-order
Lecture-46 - Density-Based Methods
of the objects
DENCLUE: using density functions
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
D N
2 2
f Gaussian
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
y
l ar 30 50
Sa age
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
It is insensitive to the order of records in input and
does not presume some canonical data distribution
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
Lecture-47 - Grid-Based Methods
Lecture-48
Model-Based Clustering Methods
Model-Based Clustering Methods
Attempt to optimize the fit between the data and some mathematical
model
Statistical and AI approach
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification
tree
Each node refers to a concept and contains a probabilistic
description of that concept
A classification tree
Limitations of COBWEB
The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
Not suitable for clustering large database data –
skewed tree and expensive probability distributions
CLASSIT
an extension of COBWEB for incremental clustering of
continuous data
suffers similar problems as COBWEB
AutoClass (Cheeseman and Stutz, 1996)
Uses Bayesian statistical analysis to estimate the
number of clusters
Popular in industry
Lecture-48 - Model-Based Clustering Methods
Other Model-Based Clustering Methods