Cluster Analysis

Cluster Analysis
What is Cluster Analysis?

 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of
Clustering
 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access patterns
What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 usability
Outliers
 Outliers are objects that do not belong to any cluster
or form clusters of very small cardinality
cluster
outliers
 In some applications we are interested in discovering

outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 x 11 ... x
1f
... x
1p

tuples/objects
 (two modes)  
 ... ... ... ... ... 
x ... x ... x 
 i1 if ip 
 ... ... ... ... ... 
x ... x ... x

 n1 nf np 
objects
 dissimilarity or distance
 0 
matrix  d(2,1) 0

 
objects
 (one mode)  d(3,1 ) d ( 3, 2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i) d ( n , 1 ) d (n ,2) ... ... 0

Measuring Similarity in
Clustering
 Dissimilarity/Similarity metric:
 The dissimilarity d(i, j) between two objects i and j is expressed in

terms of a distance function, which is typically a metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)
 The definitions of distance functions are usually different

for interval-scaled, boolean, categorical, ordinal and ratio-
scaled variables.
 Weights may be associated with different variables based

on applications and data semantics.
Type of data in cluster
analysis
 Interval-scaled variables
 e.g., salary, height
 Binary variables
 e.g., gender (M/F), has_cancer(T/F)
 Nominal (categorical) variables

 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
 Ratio-scaled variables
 population growth (1,10,100,1000,...)
 Variables of mixed types

 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure the
similarity or dissimilarity between two data objects
 The most popular conform to Minkowski distance:

1/ p

p p p
L p (i , j )   x |  | x  x | ...  | x  x |

| x


 i1
j1 i2 j2 in jn 
where i = (xi1 , x , …, xin ) and j = (xj1 , xj2 , …, xjn ) are two n-dimensional

i2

data objects, and p is a positive integer
 If p = 1, L1 is the Manhattan (or city block) distance:
L ( i , j ) | x  x |  | x  x | ...  | x  x |
1 i1 j1 i2 j2 in jn
Similarity and Dissimilarity
Between Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:
d ( i , j )  (| x x | x  x ...  | x  x
2 2 2
| | | )
i1 j1 i2 j2 in jn
 Properties
 d(i,j) 0
 d(i,i) =0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:
d (i , j )  ( w | x  x | w x ...  w n | x  x
2 2 2
|x | | )
1 i1 j1 2 i2 j2 in jn
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
1 a b a b
J=(1001100110)
object i 0 c d c d
sum a c b d p
 Simple matching coefficient distance :
d (i , j )  b c
 Jaccard coefficient distance : a b c  d
d (i , j )  b c
a b c
Binary Variables
 Another approach is to define the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefficient similarity:
s( i , j )  a d
a b c  d
 Jaccard coefficient similarity:
s( i , j )  a
a b c
Note that: s(i,j) = 1 – d(i,j)

Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)
 all attributes are asymmetric binary

 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d( jack , mary )   0 . 33
2  0 1
1 1
d( jack , jim )   0 . 67
1 1 1
12
d( jim , mary )   0 . 75
1 1  2
A simpler definition
 Each variable is mapped to a bitmap (binary vector)
 Jack: 101000
 Mary: 101010
 Jim:110000
 Simple match distance:
number of non - common bit positions
d (i , j ) 
total number of bits
 Jaccard coefficient:
number of 1' s in i j
d (i , j ) 1
number of 1' s in i j
Variables of Mixed Types
 A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio-scaled.
 One may use a weighted formula to combine their effects.
Major Clustering Approaches
 Partitioning algorithms: Construct random partitions and then
iteratively refine them by some criterion
 Hierarchical algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
K-means Clustering
 Partitional clustering approach

 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 Number of clusters, K, must be specified
K-means Clustering
Algorithm k-Means Clustering Algorithm
Input: a database D, of m records, r1, ..., rm and a desired number of
clusters k
Output: set of k clusters that minimizes the squared error criterion
Begin
randomly choose k records as the centroids for the k clusters;
repeat
assign each record, ri, to a cluster such that the distance between ri
and the cluster centroid (mean) is the smallest among the k clusters;
recalculate the centroid (mean) for each cluster based on the records
assigned to the cluster;
until no change;
End;
K-means Clustering Example
Sample 2-dimensional records for clustering example

RID Age Years_of_service
1 30 5
2 50 25
3 50 15 C1
4 25 5
5 30 10
6 55 25 C2
Assume that the number of desired clusters k is 2.

 Let the algorithm choose records with RID 3 for cluster C1 and RID 6
for cluster C2 as the initial cluster centroids
The remaining records will be assigned to one of those clusters
during the first iteration of the repeat loop
The Euclidean distance between record rj and rk in n-
dimensional space is calculated as:
rj and rk represent the records wanted to calculate the distance between t

rk indicate to the C1 and C2 in current example
Record distance from C1 distance from C2 it joins cluster
1 22.4 32.0 C1
rj 2 10.0 5.0 C2
4 25.5 36.6 C1
5 20.6 29.2 C1
 Now,the new means (centroids) for the two clusters are computed. The
mean for a cluster, Ci, with n records of m dimensions is the vector:
 In our example records ( 1,3,4,5) belong to C1

 And (2,6) belong to C2
so C1(new) =(1/4(30+50+25+30) , 1/2(50+55)=(33.75, 8.75)
 C2(new) =(1/4(5+15+5+10) , 1/2(25+25)=(52.5, 25)
 A second iteration proceeds to get the distance of each record with new
centroids
 In the following table: calculate the distance of each record from the new
C1 and C2, and assign each to the suitable cluster
Record distance from C1 distance from C2 it joins cluster
1
2
3
4
5
6
calculate new C1 and C2 ?
tip : C1 will be (28.3, 6.7) and C2 will be (51.7, 21.7)
 Move to the next iteration and do as the previous slide

 Stop if you get same results
K-means Clustering – Details
 Initial centroids are often chosen randomly.

 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
 The terminating condition is usually the squared-error criterion. For clusters
C1, ..., Ck with means m1, ..., mk, the error is defined as:
 rj is a data point in cluster Ci and mi is the corresponds mean of

the cluster
Solutions to Initial Centroids
Problem
 Multiple runs
 Helps, but probability is not on your side
 Sample and use hierarchical clustering to
determine initial centroids
 Select more than k initial centroids and then
select among these initial centroids
 Select most widely separated

Cluster Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis

Uploaded by

Copyright:

Available Formats

Cluster Analysis

What is Cluster Analysis?

 In some applications we are interested in discovering

 The dissimilarity d(i, j) between two objects i and j is expressed in

 The deﬁnitions of distance functions are usually different

 Weights may be associated with different variables based

 Nominal (categorical) variables

 Variables of mixed types

data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block) distance:

 Simple matching coefﬁcient distance :

Note that: s(i,j) = 1 – d(i,j)

 all attributes are asymmetric binary

 Partitional clustering approach

Sample 2-dimensional records for clustering example

Assume that the number of desired clusters k is 2.

rj and rk represent the records wanted to calculate the distance between t

Record distance from C1 distance from C2 it joins cluster

 In our example records ( 1,3,4,5) belong to C1

Record distance from C1 distance from C2 it joins cluster

 Move to the next iteration and do as the previous slide

 Initial centroids are often chosen randomly.

 rj is a data point in cluster Ci and mi is the corresponds mean of

You might also like