Professional Documents
Culture Documents
CLUSTERING TECHNIQUES
CLUSTER
Cluster is a group of objects that belongs to the same
class.
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
HIERARCHICAL METHODS
A hierarchical method creates a hierarchical
decomposition of the given set of data objects.
The tree representing the way in which all the points were
combined.
That may help making conclusions about the data together
with how many clusters there should be.
HIERARCHICAL CLUSTERING – CONTROLLING
CLUSTERING
Solution:
We pick one of the points in the cluster itself to represent the
cluster. This point should be selected as close to all the points in
the cluster, so it represent some kind of “center”.
We call the representative point Clustroid.
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Selecting the clustroid.
There are few ways of selecting the clustroid point:
Select as clustroid the point that minimize:
1. The sum of the distances to the other points in the cluster.
3. The sum of the squares of the distances to the other points in the
cluster.
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Example:
Using edit distance.
Cluster points: abcd, aecdb, abecb, ecdab.
Their distances:
Initialization:
As similar to k – means:
Take a small random sample and cluster optimally.
as defined above.
Variance of each dimension in a cluster may calculated as
Continuation:
If this is the last round, merge all compressed sets in the
Compressed Set and all Retained Set points into their nearest
cluster.
Mahalanobis Distance:
Normalized Euclidean distance from centroid.
For point and centroid
The Mahalanobis distance between them is
BFR ALGORITHM
Mahalanobis distance illustration:
BFR ALGORITHM
Assigning new point using Mahalanobis Distance:
If the clusters are normally dist. In d dimensional, than after
transformation one std. equals .
That means that approximately 70% of the points of the cluster
will M.D < .
Assigning rule: Assign new point to a cluster if its M.D <
threshold.
Threshold may be 4 std.
In normal dist. 3 std. distance include around 99.7% of the
points. Thus, with that threshold there is a very small chance to
reject a point that truly belong to that cluster.
BFR ALGORITHM
59
WHY SUBSPACE CLUSTERING?
(ADAPTED FROM PARSONS ET AL. SIGKDD EXPLORATIONS
2004)
60
CLIQUE (CLUSTERING IN QUEST)
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
Automatically identifying subspaces of a high dimensional data space that allow
better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular units
A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
A cluster is a maximal set of connected dense units within a subspace
BASIC IDEA OF GRID BASED
ALGORITHMS
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length
interval
It partitions an m-dimensional data space into non-overlapping
rectangular units
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
CLIQUE: THE MAJOR STEPS
Partition the data space and find the number of points that lie
inside each cell of the partition.
Identify the subspaces that contain clusters using the Apriori
principle
Identify clusters
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
=3
0 1 2 3 4 5 6 7
20
30
40
50
Sa
l a
Vacatio
ry
n
60
age
30
Vacatio
n(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
DENSE UNIT-BASED METHOD FOR
SUBSPACE CLUSTERING
COMP5331
Density
Income
Age
COMP5331
Cluster: a maximal set of connected dense units in k-dimensions
Income
Age
COMP5331
Step 1: Identify sub-spaces that contain dense units
Step 2: Identify clusters in each sub-spaces that contain
68
dense units
Suppose we want to find all dense units (e.g.,
STEP 1 dense units with density >= 20%)
COMP5331
Property
Ifa set S of points is a cluster in a k-dimensional space, then
69
S is also part of a cluster in any (k-1)-dimensional projections
of the space.
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
STEP 1
COMP5331
If T = 20%, these two units are dense.
Income
Age
70
If T = 20%, these three units are dense.
Suppose we want to find all dense units (e.g.,
STEP 1 dense units with density >= 20%)
COMP5331
We can make use of apriori approach to solve the
problem
71
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
STEP 1
COMP5331
A1 A2 A3 A4 A5 A6
I1
I2
Income
I3
I4
Age
COMP5331
L1
2-dimensional Dense Unit Generation
Candidate Generation
A1 A2 A3 A4 A5 A6
C2
COMP5331
L1
2-dimensional Dense Unit Generation
Candidate Generation
A1 A2 A3 A4 A5 A6
C2
COMP5331
Step 1: Identify sub-spaces that contain dense units
Step 2: Identify clusters in each sub-spaces that contain
dense units
75
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
STEP 2
COMP5331
Cluster 2: A4 and I1
A1 A2 A3 A4 A5 A6
e.g., A4 = 21-25 and I1 = 10k-15k
Cluster 2: Age= 21-25 and Income= 10k-15k
I1
I2
Income
I3
I4
Age