Professional Documents
Culture Documents
Data Warehousing
Lecture 9
Clustering
1
Topics
What is Cluster Analysis?
Types of Attributes in Cluster Analysis
Major Clustering Approaches
Partitioning Algorithms
Hierarchical Algorithms
3 Clustering Analysis
Clustering
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
4 Clustering Analysis
Clustering Methods
Many different method and algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up
5 Clustering Analysis
What is Cluster Analysis ?
Cluster: a collection of data objects
High similarity of objects within a cluster
Low similarity of objects across clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is an unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
6 Clustering Analysis
General Applications of Clustering
Pattern Recognition
In biology, derive plant and animal taxonomies, categorize genes
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data mining
Image Processing
Economic Science (e.g. market research)
discovering distinct groups in their customer bases and characterize customer
groups based on purchasing patterns
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
7 Clustering Analysis
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
8 Clustering Analysis
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across clusters
9 Clustering Analysis
Criteria for Clustering
A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on:
both the similarity measure used by the method and its
implementation to the new cases
ability to discover some or all of the hidden patterns
10 Clustering Analysis
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
11 Clustering Analysis
The distance function
Simplest case: one numeric attribute A
Distance(X,Y) = A(X) – A(Y)
Several numeric attributes:
Distance(X,Y) = Euclidean distance between X,Y
Nominal attributes: distance is set to 1 if values are
different, 0 if they are equal
Are all attributes equally important?
Weighting the attributes might be necessary
12 Clustering Analysis
From Data Matrix to Similarity or Dissimilarity Matrices
Data matrix (or object-by-attribute structure)
m objects with n attributes, e.g., relational data
x11 x1 j x1n
x xij xin
i1
xm1 xmj xmn
1 0
s 1 d 0
i1 i1
sij s ji dij d ji
13
sm1 smj 1 0 sij 1 d m1 d mj 0 dijAnalysis
0Clustering 1
Distance Functions (Overview)
To transform a data matrix to similarity or dissimilarity
matrices, we need a definition of distance.
Some definitions of distance functions depend on the type of
attributes
interval-scaled attributes
Boolean attributes
nominal, ordinal and ratio attributes.
Weights should be associated with different attributes based
on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
14 Clustering Analysis
Similarity and Dissimilarity Between Objects (I)
Distances are normally used to measure the similarity
or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
dij q (| xi1 x j1 |q | xi 2 x j 2 |q ... | xip x jp |q )
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and q is a positive integer
If q = 1, d is Manhattan distance
dij | xi1 x j1 | | xi 2 x j 2 | ... | xip x jp |
15 Clustering Analysis
Similarity and Dissimilarity Between Objects (II)
If q = 2, d is Euclidean distance:
Properties
d(i,j) >= 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) <= d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson product
moment correlation, or other dissimilarity measures.
d (w | x x | w | x x | ... w | x x | )
ij 1 i1 j1
2
2 i2 j2
2
i ip jp
2
16 Clustering Analysis
Types of Attributes in Clustering
Interval-scaled attributes
Continuous measures of a roughly linear scale
Binary attributes
Two-state measures: 0 or 1
Nominal, ordinal, and ratio attributes
More than two states, nominal or ordinal or nonlinear scale
Mixed types
Mixture of interval-scaled, symmetric binary, asymmetric binary,
nominal, ordinal, or ratio-scaled attributes
The gender is symmetric attribute and the remaining attributes are asymmetric
binary. Here, let the values Y and P be set to 1, and the value N be set to 0.
Then calculate only asymmetric binary 0 1
d ( jack , mary) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary) 0.75
11 2
20 Clustering Analysis: Types of Attributes
Nominal Attributes
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m is the # of matches, p is the total # of nominal attributes
pm
dij
p
Method 2: Use a large number of binary variables
creating a new binary variable for each of the M nominal
states
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Given k=3,
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
29 Clustering Analysis: Partitioning Algorithms
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
30 Clustering Analysis: Partitioning Algorithms
K-means example, step 3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
31 Clustering Analysis: Partitioning Algorithms
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
32 Clustering Analysis: Partitioning Algorithms
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
33 Clustering Analysis: Partitioning Algorithms
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
34 Clustering Analysis: Partitioning Algorithms
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster means
X
35 Clustering Analysis: Partitioning Algorithms
Problems to be considered
What can be the problems with K-means clustering?
Result can vary significantly depending on initial choice of seeds
(number and position)
Can get trapped in local minimum initial cluster
Example: centers
instances
41 Clustering Analysis
PAM example, step 1
Initialise k medoids. Let assume c1 = (3,4) and c2 =
(7,4)
Calculating distance so as to associate each data
object to its nearestCost medoid. Assume that cost is Cost
(X ) with r = 1.
Data objects Data objects
calculated
c 1
(X )using Minkowski distance metric
(distan c2 (distan
i
ce) i
ce)
3 4 2 6 3 7 4 2 6 7
3 4 3 8 4 7 4 3 8 8
3 4 4 7 4 7 4 4 7 6
3 4 6 2 5 7 4 6 2 3
3 4 6 4 3 7 4 6 4 1
3 4 7 3 5 7 4 7 3 1
3 4 8 5 6 7 4 8 5 2
42 Clustering Analysis
3 4 7 6 6 7 4 7 6 2
PAM example, step 1b
Then the clusters become:
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
Total cost = 3 + 4 +Cost
4 + 3 + 1 + 1 + 2 + 2 = 20 Cost
Data objects Data objects
c1 (distan c2 (distan
(Xi) (Xi)
ce) ce)
3 4 2 6 3 7 4 2 6 7
3 4 3 8 4 7 4 3 8 8
3 4 4 7 4 7 4 4 7 6
3 4 6 2 5 7 4 6 2 3
3 4 6 4 3 7 4 6 4 1
3 4 7 3 5 7 4 7 3 1
3 4 8 5 6 7 4 8 5 2
43 Clustering Analysis
3 4 7 6 6 7 4 7 6 2
PAM example, step 2
Selection of nonmedoid O′ randomly. Let us
assume O′ = (7,3). So now the medoids
are c1(3,4) and O′(7,3)
Calculate the cost ofCost
new medoid by using the Cost
Data objects Data objects
formula in
c 1
(X )the step1. Total cost =
(distan O′
(X )
(distan
i
ce) i
ce)
3 3+4+4+2+2+1+3+3
4 2 6 3 = 22 7 3 2 6 8
3 4 3 8 4 7 3 3 8 9
3 4 4 7 4 7 3 4 7 7
3 4 6 2 5 7 3 6 2 2
3 4 6 4 3 7 3 6 4 2
3 4 7 3 5 7 3 7 4 1
3 4 8 5 6 7 3 8 5 3
44 Clustering Analysis
3 4 7 6 6 7 3 7 6 3
PAM example, step 2b
So cost of swapping medoid from c2 to O′ is
S = current total cost – past total cost = 22-20
= 2 >0
So moving to O′ would be bad idea, so the previous
choice was good, and algorithm terminates here (i.e
there is no change in the medoids).
It may happen some data points may shift from one
cluster to another cluster depending upon their
closeness to medoid
45 Clustering Analysis
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
d mean (Ci , C j ) mi m j
Element comparison (Average distance):
1
d avg (Ci , C j ) p p'
54
ni n j pCi p'Clustering
C j
Analysis: Hierarchical Algorithms
Distances - Hierarchical Clustering
(Graphical Representation)
Four measures for distance between clusters are (1) single linkage, (2) complete
linkage, (3) centroid comparison and (4) element comparison
(3)centroid
(2)complete
Cluster 3
x x = centroids
56 Clustering Analysis
Other Clustering Methods
Incremental Clustering
Probability-based Clustering, Bayesian Clustering
EM Algorithm
Cluster Schemes
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
57 Clustering Analysis
Advanced Method: BIRCH (Overview)
Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang,
Raghu Ramakrishnan, Miron Livny, 1996]
Incremental, hierarchical, one scan
Save clustering information in a tree
Each entry in the tree contains information about one cluster
New nodes inserted in closest entry in tree
Only works with "metric" attributes
Must have Euclidean coordinates
Designed for very large data sets
Time and memory constraints are explicit
Treats dense regions of data points as sub-clusters
Not all data points are important for clustering
Only one scan of data is necessary
58 Clustering Analysis
BIRCH (Merits)
Incremental, distance-based approach
Decisions are made without scanning all data points, or all
currently existing clusters
Does not need the whole data set in advance
Unique approach: Distance-based algorithms generally need
all the data points to work
Make best use of available memory while minimizing I/O
costs
Does not assume that the probability distributions on
attributes is independent
59 Clustering Analysis
BIRCH – Clustering Feature and Clustering Feature Tree
BIRTH introduces two concepts, clustering feature and clustering
feature tree (CF Tree), which are used to summarize cluster
representations.
These structures help the clustering method achieve good speed and
scalability in large databases and make it effective for incremental and
dynamic clustering of incoming object
Given n d-dimensional data objects or points in a cluster, we can
define the centroid x0, radius R and diameter D of the cluster as
follows:
60 Clustering Analysis
BIRCH – Centroid, Radius and Diameter
• Given a cluster of instances , we define:
• Centroid: the center of a cluster
61 Clustering Analysis
BIRCH – Centroid Euclidean and Manhattan distances
• The centroid Euclidean distance and centroid
Manhattan distance are defined between any two
clusters.
• Centroid Euclidean distance
62 Clustering Analysis
BIRCH
(Average inter-cluster, Average intra-cluster, Variance increase)
• The average inter-cluster, the average intra-cluster, and the
variance increase distances are defined as follows
• Average inter-cluster
• Average intra-cluster
63 Clustering Analysis
Clustering Feature
CF = (N,LS,SS)
N: Number of points in cluster
LS: Sum of points in the cluster
SS: Sum of squares of points in the cluster
CF Tree
Balanced search tree
Node has CF triple for each child
Leaf node represents cluster and has CF value for each
subcluster in it.
Subcluster has maximum diameter
64 Clustering Analysis
Clustering Feature Vector
(3,4) (4,7)
Clustering Feature: CF = (N, LS, SS) (2,6) (3,8)
(4,5)
N: Number of data points
LS: Ni=1=Xi 10
CF = (5, (16,30),(54,190))
SS: Ni=1=Xi2
9
6
(6,2) (8,4)
(7,2) (8,5)
5
2 (7,4)
1
0
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (36,17),(262,65))
CF = (10, (52,47),(316,255))
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
67 Clustering Analysis
Properties of CF-Tree
Each non-leaf node has at most B entries
Each leaf node has at most L CF entries which each satisfy threshold
T
Node size is determined by dimensionality of data space and input
parameter P (page size)
Branching Factor
and
Thread hold
68 Clustering Analysis
BIRCH Algorithm (CF-Tree Insertion)
70 Clustering Analysis
BIRCH Algorithm (Overall steps)
71 Clustering Analysis
Details of Each Step
Phase 1: Load data into memory
Build an initial in-memory CF-tree with the data (one scan)
Subsequent phases become fast, accurate, less order sensitive
Phase 2: Condense data
Rebuild the CF-tree with a larger T
Condensing is optional
Phase 3: Global clustering
Use existing clustering algorithm on CF entries
Helps fix problem where natural clusters span nodes
Phase 4: Cluster refining
Do additional passes over the dataset & reassign data points to the
closest centroid from phase 3
Refining is optional
72 Clustering Analysis
Summary of BIRCH
BIRCH works with very large data sets
Explicitly bounded by computational resources.
The computation complexity is O(n), where n is the number of
objects to be clustered.
Runs with specified amount of memory (P)
Superior to CLARANS and k-MEANS
Quality, speed, stability and scalability
73 Clustering Analysis
CURE (Clustering Using REpresentatives)
CURE was proposed by Guha, Rastogi & Shim, 1998
It stops the creation of a cluster hierarchy if a level consists of k
clusters
Each cluster has c representatives (instead of one)
Choose c well scattered points in the cluster
Shrink them towards the mean of the cluster by a fraction of
The representatives capture the physical shape and geometry of the
cluster
It can treat arbitrary shaped clusters and avoid single-link effect.
Merge the closest two clusters
Distance of two clusters: the distance between the two closest
representatives
74 Clustering Analysis
CURE Algorithm
75 Clustering Analysis
CURE Algorithm
76 Clustering Analysis
CURE Algorithm (Another form)
77 Clustering Analysis
CURE Algorithm (for Large Databases)
78 Clustering Analysis
Data Partitioning and Clustering
y s = 50
p=2
s/p = 25
x s/pq = 5
y y
y y
x x
x x
79 Clustering Analysis
Cure: Shrinking Representative Points
Shrink the multiple representative points towards the gravity center by
a fraction of .
Multiple representatives capture the shape of the cluster
y y
x x
80 Clustering Analysis
Clustering Categorical Data: ROCK
ROCK: RObust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
Use links to measure similarity/proximity
Not distance-based with categorical attributes
Computational complexity: O(n2 nmmma n2 log n)
Basic ideas (Jaccard coefficient):
Similarity function and neighbors:
T1 T2
Let T1 = {1,2,3}, T2={3,4,5} Sim(T1 , T2 )
T1 T2
{3} 1
Sim(T1, T 2) 0.2
{1,2,3,4,5} 5
81 Clustering Analysis
ROCK: An Example
Links: The number of common neighbors for the two points.
Using Jaccard
Use Distances to determine neighbors
(pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0
(pt2,pt3) = 0.6, (pt2,pt4) = 0.2
(pt3,pt4) = 0.2
Use 0.2 as threshold for neighbors
Pt2 and Pt3 have 3 common neighbors
Pt3 and Pt4 have 3 common neighbors
Pt2 and Pt4 have 3 common neighbors
Resulting clusters (1), (2,3,4) which makes more sense
82 Clustering Analysis
ROCK: Property & Algorithm
Links: The number of common neighbours for the
two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3} {1,2,4}
Algorithm
Draw random sample
Cluster with links (maybe agglomerative hierarchical)
Label data in disk
83 Clustering Analysis
CHAMELEON
CHAMELEON: hierarchical clustering using dynamic modeling, by G.
Karypis, E.H. Han and V. Kumar’99
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high relative to
the internal interconnectivity of the clusters and closeness of
items within the clusters
A two phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
84 Clustering Analysis
Graph-based clustering
Sparsification techniques keep the connections to the most
similar (nearest) neighbors of a point while breaking the
connections to less similar points.
The nearest neighbors of a point tend to belong to the same
class as the point itself.
This reduces the impact of noise and outliers and sharpens
the distinction between clusters.
85 Clustering Analysis
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Data Set
Merge Partition
Final Clusters
86 Clustering Analysis
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
87 Clustering Analysis
Density-Based Clustering (Background)
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density -reachable
from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition: p MinPts = 5
|NEps (q)| >= MinPts q
Eps = 1 cm
88 Clustering Analysis
Density-Based Clustering (Background)
The neighborhood with in a radius Eps of a given object is call the
neighborhood of the object.
If the neighborhood of an object contains at least a minimum
number, MinPts, of objects, then the object is called a core object.
Given a set of objects, D, we say that an object p is directly density-
reachable from object q if p is within the neighborhood of q, and q
is a core object.
An object p is density-reachable from object q with respect to and
MinPts in a set of objects D, if there is a chain of object p1, …, pn,
where p1=q and pn=p such that p(i+1) is directly density-reachable from
pi with respect to and MinPts, for 1≤i≤n, from pi D.
An object p is density-connected to object q with respect to
89 Clustering Analysis
Density-Based Clustering p
p1
q
Density-reachable:
A point p is density-reachable from a point q wrt. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-
reachable from pi
Density-connected
A point p is density-connected to a point q wrt. Eps, MinPts if there is a
point o such that both, p and q are density-reachable from o wrt. Eps and
MinPts.
p q
90 Clustering Analysis
DBSCAN: Density Based Spatial Clustering of Applications with
Noise
Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with
noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
91 Clustering Analysis
DBSCAN: The Algorithm
Arbitrary select a point p
Continue the process until all of the points have been processed.
92 Clustering Analysis