You are on page 1of 83

UNIT-4

CLUSTERING TECHNIQUES
CLUSTER
 Cluster is a group of objects that belongs to the same
class.

 In other words, similar objects are grouped in one cluster


and dissimilar objects are grouped in another cluster.
CLUSTERING
What is Clustering?
 Clustering is the process of making a group of abstract
objects into classes of similar objects.
CLUSTERING
 The process of grouping a set of physical or abstract
objects into classes of similar objects
is called clustering.

 Clustering is also called data segmentation in some


applications because clustering partitions large data sets
into groups according to their similarity.

 Clustering can also be used for outlier detection


EXAMPLE: APPLICATION OF
CLUSTERING ALGORITHM
APPLICATION OF CLUSTERING
ALGORITHM
 The data from customer base is divided into clusters; we
can make an informed decision about who we think is
best suited for this product.
 Suppose we are a market manager, and we have a new
tempting product to sell.
 We are sure that the product would bring enormous
profit, as long as it is sold to the right people.
 So, how can we tell who is best suited for the product
from our company's huge customer base?
CLUSTERING
 In machine learning, clustering is an example of
unsupervised learning.

o Unsupervised learning do not rely on predefined classes


and class-labeled training examples.

o For this reason, clustering is a form of learning by


observation, rather than learning by examples.
POINTS TO REMEMBER
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of
data into groups based on data similarity and then assign
the labels to the groups.
 The main advantage of clustering over classification is
that, it is adaptable to changes and helps single out
useful features that distinguish different groups
A CATEGORIZATION OF MAJOR
CLUSTERING METHODS

 Partitioning Method
 Hierarchical Method

 Density-based Method

 Grid-Based Method

 Model-Based Method

 Constraint-based Method
HIERARCHICAL METHODS
 A hierarchical method creates a hierarchical
decomposition of the given set of data objects.

 A hierarchical clustering method works by grouping data


objects into a tree of clusters.

 We can classify hierarchical methods on the basis of


how the hierarchical decomposition is formed
ALGORITHMIC APPROACHES
There are two main approaches:
 Hierarchical algorithms:
 Agglomerative (bottom-up): Start with each point as a cluster. Clusters
are combined based on their “closeness”, using some definition of
“close” (will be discussed later).
 Divisive (top-down): Start with one cluster including all points and
recursively split each cluster based on some criterion.
Will not be discussed in this presentation.
 Point assignment algorithms:
 Points are considered in some order, and each one is assigned
to the cluster into which it best fit.
ALGORITHMIC APPROACHES

Other clustering algorithms distinctions:


 Whether the algorithm assumes a Euclidean space, or whether
the algorithm works for arbitrary distance measure (Non
Euclidean space).

 Whether the algorithm assumes that the data is small enough to


fit in main memory, or whether data must reside in secondary
memory primarily.
AGGLOMERATIVE HIERARCHICAL
CLUSTERING
 A hierarchical method can be classified as being either
agglomerative, also called the bottom-up approach,
starts with each object forming a separate group.

 It successively merges the objects or groups that are


close to one another, until all of the groups are merged
into one (the topmost level of the hierarchy), or until a
termination conditions are satisfied.
DIVISIVE HIERARCHICAL
CLUSTERING
 The divisive approach, also called the top-down
approach, starts with all of the objects in the same
cluster.

 In each successive iteration, It subdivides the cluster into


smaller and smaller pieces, until each object forms a
cluster on its own or until it satisfies certain termination
conditions, each cluster is within a certain threshold.
HIERARCHICAL
CLUSTERING
EUCLIDEAN SPACE
HIERARCHICAL CLUSTERING
We first consider Euclidean space.
The algorithm:

-While stop condition is false Do


-Pick the best two clusters to merge.
-Combine them into one cluster.
-End;
HIERARCHICAL CLUSTERING
Three important questions:

1. How do you represent a cluster with more than one


point?
2. How will you choose which two clusters to merge?
3. When will we stop combining clusters?
HIERARCHICAL CLUSTERING
 Since we assume Euclidean space, we represent a cluster by
its centroid or average of points in the cluster. Of course
that in clusters with one point, that point is the centroids.

 Merging rule: merge the two clusters with the shortest


Euclidean distance between their centroids.

 Stopping rules: We may know in advance how many


clusters there should be and stop where this number
reached. Stop merging when minimum distance between
any two clusters is greater than some threshold.
HIERARCHICAL CLUSTERING - CLUSTERING
ILLUSTRATION
.
HIERARCHICAL CLUSTERING- TREE
REPRESENTATION

 The tree representing the way in which all the points were
combined.
 That may help making conclusions about the data together
with how many clusters there should be.
HIERARCHICAL CLUSTERING – CONTROLLING
CLUSTERING

Alternative rules for controlling hierarchical


clustering:
 Take the distance between two clusters to be the minimum of the distances
between any two points, one chosen from each cluster.
For example in phase 2 we would next combine (10,5) with the two points
cluster .
 The Radius of a cluster is the maximum distance between all the points and
the centroid. Combine the two clusters whose resulting cluster has the
lowest radius. May use also average or sum of squares of distances from
the centroid.
HIERARCHICAL CLUSTERING – CONTROLLING
CLUSTERING
Continuation.
 The Diameter of a cluster is the maximum distance between any
two points of the cluster. We merge those clusters whose resulting
cluster has the lowest diameter.
For example, the centroid of the cluster in step 3 is (11,4), so the
radius will be

And the diameter will be


HIERARCHICAL CLUSTERING – STOPPING
RULES
Alternative stopping rules.
 Stop if the diameter of cluster results from the best merger
exceeds some threshold.
 Stop if the density of the cluster that results from the best
merger is lower than some threshold. The density may be
defined as the number of cluster points per unit volume of the
cluster. Volume may be some power of the radius or diameter.
 Stop when there is evidence that next pair of clusters to be
combined yields bad cluster. For example, if we track the
average diameter of all clusters, we will see a sudden jump in
that value when a bad merge occurred.
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
 Main problem: We use distance measures such as mentioned at
the beginning. So we can’t base distances on location of points.
The problem arises when we need to represent a cluster, Because
we cannot replace a collection of points by their centroid.
Euclidean space Strings space (edit
distance)
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Example:
Suppose we use edit distance, so
But there is no string represents their average.

Solution:
We pick one of the points in the cluster itself to represent the
cluster. This point should be selected as close to all the points in
the cluster, so it represent some kind of “center”.
We call the representative point Clustroid.
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Selecting the clustroid.
There are few ways of selecting the clustroid point:
Select as clustroid the point that minimize:
1. The sum of the distances to the other points in the cluster.

2. The maximum distance to another point in the cluster.

3. The sum of the squares of the distances to the other points in the
cluster.
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Example:
 Using edit distance.
 Cluster points: abcd, aecdb, abecb, ecdab.

Their distances:

Applying the three clustroid criteria to each of the four points:


HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Results:
For every criteria selected, “aecdb” will be selected as clustroid.
Measuring distance between clusters:
Using clustroid instead of centroid, we can apply all options used
for the Euclidean space measure.
That include:
 The minimum distance between any pair of points.

 Average distance between all pair of points.

 Using radius or diameter (the same definition).


HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
Stopping criterion:
 Uses criterions not directly using centroids, except the radius
which is valid also to Non-Euclidean spaces.
 So all criterions may be used for Non-Euclidean spaces as
well.
BFR ALGORITHM
BFR ALGORITHM
 BFR (Bradley – Fayyad – Reina) is a variant of k – means,
designed to handle very large (disk resident) data sets.

 Assumes that clusters are normally distributed around a


centroid in a Euclidean space.
 Standard deviation In different dimensions may vary.
 For example if d=2, we get an ellipse along with the axes.
BFR ALGORITHM

 Points are read one main-memory-full at a time.


 Most points from previous memory loads are summarized
by simple statistics.
 To begin, from the initial load we select the initial k
centroids by some sensible approach.
BFR ALGORITHM

Initialization:
As similar to k – means:
 Take a small random sample and cluster optimally.

 Select points which are far from one another as in the k-


means algorithm.
BFR ALGORITHM
The main memory data uses three types of
objects:
 The Discard Set: Points close enough to a centroid to be
summarized (how to summarize will be defined later).
 The Compression Set: Groups of points that are close together
but are not close to any centroid. They are summarized, but
not assigned to a cluster.
 The Retained Set: Isolated points.

 How to summarize will explained later.


BFR ALGORITHM

Processed point representation in memory:


BFR ALGORITHM
Summarizing sets of points:
For the Discard Set, each cluster is summarized by:
 The number of points N.

 The vector SUM whose i-th component is the sum of


coordinates of the points in the i-th dimension.
 The vector SUMSQ whose i-th component is the sum of
squares of coordinates in the i-th dimension.
BFR ALGORITHM
Why use such representation?
 2d+1 values represent any size cluster, where d is the number
of dimensions.
 Average of each dimension (centroid) can be calculated as

as defined above.
 Variance of each dimension in a cluster may calculated as

for the i-th coordinate.


 Standard deviation is the square root of that.
Such representation give us a straightforward way of calculatind
the centroid and std. of a cluster for any marginal points
addition.
BFR ALGORITHM

Processing a chunk of points (memory-load) data:


 Find those points that are “sufficiently close” to a cluster centroid.
Add those points to that cluster and the Discard Set.
 Use any main-memory clustering algorithm to cluster the remaining
point and the old Retained Set.
 Clusters go to Compress Set.
 Outlying points go to the Retained Set.

 Adjust statistics of the clusters to account for the new points.


 Add N’s, SUM’s and SUMSQ’s.
 Consider merging compressed sets in the Compressed Set.
BFR ALGORITHM

Continuation:
 If this is the last round, merge all compressed sets in the
Compressed Set and all Retained Set points into their nearest
cluster.

 Comment: In the last round we may treat the Compressed and


Retained sets as an outliers and never cluster them at all.
BFR ALGORITHM
Assigning new point - How Close is close
enough?

Mahalanobis Distance:
Normalized Euclidean distance from centroid.
For point and centroid
The Mahalanobis distance between them is
BFR ALGORITHM
Mahalanobis distance illustration:
BFR ALGORITHM
Assigning new point using Mahalanobis Distance:
 If the clusters are normally dist. In d dimensional, than after
transformation one std. equals .
 That means that approximately 70% of the points of the cluster
will M.D < .
 Assigning rule: Assign new point to a cluster if its M.D <
threshold.
 Threshold may be 4 std.
 In normal dist. 3 std. distance include around 99.7% of the
points. Thus, with that threshold there is a very small chance to
reject a point that truly belong to that cluster.
BFR ALGORITHM

Merging two compressed sets:


 Calculate the variance of the combined cluster.
 We can easily do that using the clusters representation
mentioned before.
 Merge if variance is below some threshold.
(2B) CURE
EXAMPLE
EXAMPLE
FINISHING CURE
CLUSTERING IN NON-EUCLIDEAN
SPACES
GRGPF (V. GANTI, R. RAMAKRISHNAN, J.
GEHRKE, A. POWELL, AND J. FRENCH),
 GRGPF takes ideas from both hierarchical and point-assignment
approaches.
 Like CURE, it represents clusters by sample points in main memory.
However,it also tries to organize the clusters hierarchically, in a tree,
so a new point can be assigned to the appropriate cluster by passing
it down the tree.
 Leaves of the tree hold summaries of some clusters
 Interior nodes hold subsets of the information describing the clusters
reachable through that node.
WE CONSIDER
1. N, the number of points in the cluster.
2. The clustroid of the cluster, which is defined specifically to be the point in the
cluster that minimizes the sum of the squares of the distances to the other points; that
is, the clustroid is the point in the cluster with the smallest ROWSUM.
3. The rowsum of the clustroid of the cluster.
4. For some chosen constant k, the k points of the cluster that are closest to the
clustroid, and their rowsums. These points are part of the representation in case the
addition of points to the cluster causes the clustroid to change. The assumption is
made that the new clustroid would be one of these k points near the old clustroid.
5. The k points of the cluster that are furthest from the clustroid and their rowsums. These
points are part of the representation so that we can consider whether two clusters are
close enough to merge. The assumption is made that if two clusters are close, then a
pair of points distant from their respective clustroids would be close.
3 STEP
 Initializing the Cluster Tree
 Adding Points to the clusters.

 Splitting and Merging Clusters


CLUSTERING
HIGH DIMENSIONAL DATA
 Clustering high-dimensional data
 Many applications: text documents, DNA micro-array data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
 Subspace-clustering: Search for clusters existing in subspaces of the given
high dimensional data space
 CLIQUE, ProClus, and bi-clustering approaches
 Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly
correlated/redundant
 Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
TRADITIONAL DISTANCE MEASURES MAY NOT
BE EFFECTIVE ON HIGH-D DATA
 Traditional distance measure could be dominated by noises in many
dimensions
 Ex. Which pairs of customers are more similar?

 By Euclidean distance, we get,

 despite Ada and Cathy look more similar


 Clustering should not only consider dimensions but also attributes (features)
 Feature transformation: effective if most dimensions are relevant (PCA
& SVD useful when features are highly correlated/redundant)
 Feature selection: useful to find a subspace where the data have nice
clusters
THE CURSE OF DIMENSIONALITY
(GRAPHS ADAPTED FROM PARSONS ET AL. KDD EXPLORATIONS
2004)

 Data in only one dimension is relatively


packed
 Adding a dimension “stretch” the points
across that dimension, making them further
apart
 Adding more dimensions will make the
points further apart—high dimensional data
is extremely sparse
 Distance measure becomes meaningless—
due to equi-distance

59
WHY SUBSPACE CLUSTERING?
(ADAPTED FROM PARSONS ET AL. SIGKDD EXPLORATIONS
2004)

 Clusters may exist only in some subspaces


 Subspace-clustering: find clusters in all the subspaces

60
CLIQUE (CLUSTERING IN QUEST)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
 Automatically identifying subspaces of a high dimensional data space that allow
better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length interval
 It partitions an m-dimensional data space into non-overlapping rectangular units
 A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
 A cluster is a maximal set of connected dense units within a subspace
BASIC IDEA OF GRID BASED
ALGORITHMS
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a subspace
CLIQUE: THE MAJOR STEPS
 Partition the data space and find the number of points that lie
inside each cell of the partition.
 Identify the subspaces that contain clusters using the Apriori
principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of interests.

 Generate minimal description for the clusters


 Determine maximal regions that cover a cluster of connected dense units for
each cluster
 Determination of minimal cover for each cluster
Salary
(10,000)

=3
0 1 2 3 4 5 6 7

20
30
40
50

Sa
l a
Vacatio

ry
n
60
age

30
Vacatio
n(week)
50
0 1 2 3 4 5 6 7
20
30
40

age
50
60
age
DENSE UNIT-BASED METHOD FOR
SUBSPACE CLUSTERING

COMP5331
Density

Dense unit: a unit if the fraction of data points contained in it


is at least a threshold, T

Income

Age

If T = 20%, these three units are dense.


66
DENSE UNIT-BASED METHOD FOR
SUBSPACE CLUSTERING

COMP5331
Cluster: a maximal set of connected dense units in k-dimensions

This unit forms another cluster.

Income

Age

If T = 20%, these two units form a cluster.


67
The problem is to find which sub-spaces contain dense units.
The second problem is to find clusters from each sub-space containing dense units
DENSE UNIT-BASED METHOD FOR
SUBSPACE CLUSTERING

COMP5331
 Step 1: Identify sub-spaces that contain dense units
 Step 2: Identify clusters in each sub-spaces that contain

68
dense units
Suppose we want to find all dense units (e.g.,
STEP 1 dense units with density >= 20%)

COMP5331
 Property
 Ifa set S of points is a cluster in a k-dimensional space, then

69
S is also part of a cluster in any (k-1)-dimensional projections
of the space.
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
STEP 1

COMP5331
If T = 20%, these two units are dense.

Income

Age

70
If T = 20%, these three units are dense.
Suppose we want to find all dense units (e.g.,
STEP 1 dense units with density >= 20%)

COMP5331
 We can make use of apriori approach to solve the
problem

71
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
STEP 1

COMP5331
A1 A2 A3 A4 A5 A6

I1
I2
Income
I3
I4
Age

With respect to dimension Age,


A3 and A4 are dense units. 72
With respect to dimension Income,
I1, I2 and I3 are dense units
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
APRIORI

COMP5331
L1
2-dimensional Dense Unit Generation
Candidate Generation

A1 A2 A3 A4 A5 A6
C2

Dense Unit Generation


I1
I2 L2
Income
I3 3-dimensional Dense Unit Generation
Candidate Generation
I4
Age C3

With respect to dimension Age, Dense Unit Generation


A3 and A4 are dense units. 73
With respect to dimension Income,
I1, I2 and I3 are dense units L3

Suppose we want to find all dense units1.(e.g.,
Join Step
dense units with density >= 20%) 2. Prune Step
APRIORI

COMP5331
L1
2-dimensional Dense Unit Generation
Candidate Generation

A1 A2 A3 A4 A5 A6
C2

Dense Unit Generation


I1
I2 L2
Income Counting Step
I3 3-dimensional Dense Unit Generation
Candidate Generation
I4
Age C3

With respect to dimension Age, Dense Unit Generation


A3 and A4 are dense units. 74
With respect to dimension Income,
I1, I2 and I3 are dense units L3

DENSE UNIT-BASED METHOD FOR
SUBSPACE CLUSTERING

COMP5331
 Step 1: Identify sub-spaces that contain dense units
 Step 2: Identify clusters in each sub-spaces that contain
dense units

75
Suppose we want to find all dense units (e.g.,
dense units with density >= 20%)
STEP 2

COMP5331
Cluster 2: A4 and I1
A1 A2 A3 A4 A5 A6
e.g., A4 = 21-25 and I1 = 10k-15k
Cluster 2: Age= 21-25 and Income= 10k-15k
I1
I2
Income
I3
I4
Age

With respect to dimension Age,


A3 and A4 are dense units. Cluster 1: (A3 or A4) and I3
With respect to dimension Income, 76
I1, I2 and I3 are dense units e.g., A3 = 16-20, A4 = 21-25 and I3 = 20k-25k
Cluster 1: Age=16-25 and Income=20k-25k

You might also like