Clustering Hierarichal Method

UNIT-4
CLUSTERING TECHNIQUES
CLUSTER
 Cluster is a group of objects that belongs to the same
class.
 In other words, similar objects are grouped in one cluster

and dissimilar objects are grouped in another cluster.
CLUSTERING
What is Clustering?
 Clustering is the process of making a group of abstract
objects into classes of similar objects.
CLUSTERING
 The process of grouping a set of physical or abstract
objects into classes of similar objects
is called clustering.
 Clustering is also called data segmentation in some

applications because clustering partitions large data sets
into groups according to their similarity.
 Clustering can also be used for outlier detection

EXAMPLE: APPLICATION OF
CLUSTERING ALGORITHM
APPLICATION OF CLUSTERING
ALGORITHM
 The data from customer base is divided into clusters; we
can make an informed decision about who we think is
best suited for this product.
 Suppose we are a market manager, and we have a new
tempting product to sell.
 We are sure that the product would bring enormous
profit, as long as it is sold to the right people.
 So, how can we tell who is best suited for the product
from our company's huge customer base?
CLUSTERING
 In machine learning, clustering is an example of
unsupervised learning.
o Unsupervised learning do not rely on predefined classes

and class-labeled training examples.
o For this reason, clustering is a form of learning by

observation, rather than learning by examples.
POINTS TO REMEMBER
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of
data into groups based on data similarity and then assign
the labels to the groups.
 The main advantage of clustering over classification is
that, it is adaptable to changes and helps single out
useful features that distinguish different groups
A CATEGORIZATION OF MAJOR
CLUSTERING METHODS
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
HIERARCHICAL METHODS
 A hierarchical method creates a hierarchical
decomposition of the given set of data objects.
 A hierarchical clustering method works by grouping data

objects into a tree of clusters.
 We can classify hierarchical methods on the basis of

how the hierarchical decomposition is formed
ALGORITHMIC APPROACHES
There are two main approaches:
 Hierarchical algorithms:
 Agglomerative (bottom-up): Start with each point as a cluster. Clusters
are combined based on their “closeness”, using some definition of
“close” (will be discussed later).
 Divisive (top-down): Start with one cluster including all points and
recursively split each cluster based on some criterion.
Will not be discussed in this presentation.
 Point assignment algorithms:
 Points are considered in some order, and each one is assigned
to the cluster into which it best fit.
ALGORITHMIC APPROACHES
Other clustering algorithms distinctions:

 Whether the algorithm assumes a Euclidean space, or whether
the algorithm works for arbitrary distance measure (Non
Euclidean space).
 Whether the algorithm assumes that the data is small enough to

fit in main memory, or whether data must reside in secondary
memory primarily.
AGGLOMERATIVE HIERARCHICAL
CLUSTERING
 A hierarchical method can be classified as being either
agglomerative, also called the bottom-up approach,
starts with each object forming a separate group.
 It successively merges the objects or groups that are

close to one another, until all of the groups are merged
into one (the topmost level of the hierarchy), or until a
termination conditions are satisfied.
DIVISIVE HIERARCHICAL
CLUSTERING
 The divisive approach, also called the top-down
approach, starts with all of the objects in the same
cluster.
 In each successive iteration, It subdivides the cluster into

smaller and smaller pieces, until each object forms a
cluster on its own or until it satisfies certain termination
conditions, each cluster is within a certain threshold.
HIERARCHICAL
CLUSTERING
EUCLIDEAN SPACE
HIERARCHICAL CLUSTERING
We first consider Euclidean space.
The algorithm:
-While stop condition is false Do

-Pick the best two clusters to merge.
-Combine them into one cluster.
-End;
Three important questions:
1. How do you represent a cluster with more than one

point?
2. How will you choose which two clusters to merge?
3. When will we stop combining clusters?
 Since we assume Euclidean space, we represent a cluster by
its centroid or average of points in the cluster. Of course
that in clusters with one point, that point is the centroids.
 Merging rule: merge the two clusters with the shortest

Euclidean distance between their centroids.
 Stopping rules: We may know in advance how many

clusters there should be and stop where this number
reached. Stop merging when minimum distance between
any two clusters is greater than some threshold.
HIERARCHICAL CLUSTERING - CLUSTERING
ILLUSTRATION
.
HIERARCHICAL CLUSTERING- TREE
REPRESENTATION
 The tree representing the way in which all the points were
combined.
 That may help making conclusions about the data together
with how many clusters there should be.
HIERARCHICAL CLUSTERING – CONTROLLING
CLUSTERING
Alternative rules for controlling hierarchical

clustering:
 Take the distance between two clusters to be the minimum of the distances
between any two points, one chosen from each cluster.
For example in phase 2 we would next combine (10,5) with the two points
cluster .
 The Radius of a cluster is the maximum distance between all the points and
the centroid. Combine the two clusters whose resulting cluster has the
lowest radius. May use also average or sum of squares of distances from
the centroid.
HIERARCHICAL CLUSTERING – CONTROLLING
CLUSTERING
Continuation.
 The Diameter of a cluster is the maximum distance between any
two points of the cluster. We merge those clusters whose resulting
cluster has the lowest diameter.
For example, the centroid of the cluster in step 3 is (11,4), so the
radius will be
And the diameter will be

HIERARCHICAL CLUSTERING – STOPPING
RULES
Alternative stopping rules.
 Stop if the diameter of cluster results from the best merger
exceeds some threshold.
 Stop if the density of the cluster that results from the best
merger is lower than some threshold. The density may be
defined as the number of cluster points per unit volume of the
cluster. Volume may be some power of the radius or diameter.
 Stop when there is evidence that next pair of clusters to be
combined yields bad cluster. For example, if we track the
average diameter of all clusters, we will see a sudden jump in
that value when a bad merge occurred.
HIERARCHICAL CLUSTERING IN NON-EUCLIDEAN
SPACES
 Main problem: We use distance measures such as mentioned at
the beginning. So we can’t base distances on location of points.
The problem arises when we need to represent a cluster, Because
we cannot replace a collection of points by their centroid.
Euclidean space Strings space (edit
distance)
SPACES
Example:
Suppose we use edit distance, so
But there is no string represents their average.
Solution:
We pick one of the points in the cluster itself to represent the
cluster. This point should be selected as close to all the points in
the cluster, so it represent some kind of “center”.
We call the representative point Clustroid.
SPACES
Selecting the clustroid.
There are few ways of selecting the clustroid point:
Select as clustroid the point that minimize:
1. The sum of the distances to the other points in the cluster.
2. The maximum distance to another point in the cluster.
3. The sum of the squares of the distances to the other points in the
cluster.
SPACES
Example:
 Using edit distance.
 Cluster points: abcd, aecdb, abecb, ecdab.
Their distances:
Applying the three clustroid criteria to each of the four points:

SPACES
Results:
For every criteria selected, “aecdb” will be selected as clustroid.
Measuring distance between clusters:
Using clustroid instead of centroid, we can apply all options used
for the Euclidean space measure.
That include:
 The minimum distance between any pair of points.
 Average distance between all pair of points.
 Using radius or diameter (the same definition).

SPACES
Stopping criterion:
 Uses criterions not directly using centroids, except the radius
which is valid also to Non-Euclidean spaces.
 So all criterions may be used for Non-Euclidean spaces as
well.
BFR ALGORITHM
BFR ALGORITHM
 BFR (Bradley – Fayyad – Reina) is a variant of k – means,
designed to handle very large (disk resident) data sets.
 Assumes that clusters are normally distributed around a

centroid in a Euclidean space.
 Standard deviation In different dimensions may vary.
 For example if d=2, we get an ellipse along with the axes.
BFR ALGORITHM
 Points are read one main-memory-full at a time.

 Most points from previous memory loads are summarized
by simple statistics.
 To begin, from the initial load we select the initial k
centroids by some sensible approach.
BFR ALGORITHM
Initialization:
As similar to k – means:
 Take a small random sample and cluster optimally.
 Select points which are far from one another as in the k-

means algorithm.
BFR ALGORITHM
The main memory data uses three types of
objects:
 The Discard Set: Points close enough to a centroid to be
summarized (how to summarize will be defined later).
 The Compression Set: Groups of points that are close together
but are not close to any centroid. They are summarized, but
not assigned to a cluster.
 The Retained Set: Isolated points.
 How to summarize will explained later.

BFR ALGORITHM
Processed point representation in memory:

BFR ALGORITHM
Summarizing sets of points:
For the Discard Set, each cluster is summarized by:
 The number of points N.
 The vector SUM whose i-th component is the sum of

coordinates of the points in the i-th dimension.
 The vector SUMSQ whose i-th component is the sum of
squares of coordinates in the i-th dimension.
BFR ALGORITHM
Why use such representation?
 2d+1 values represent any size cluster, where d is the number
of dimensions.
 Average of each dimension (centroid) can be calculated as
as defined above.
 Variance of each dimension in a cluster may calculated as
for the i-th coordinate.

 Standard deviation is the square root of that.
Such representation give us a straightforward way of calculatind
the centroid and std. of a cluster for any marginal points
addition.
BFR ALGORITHM
Processing a chunk of points (memory-load) data:

 Find those points that are “sufficiently close” to a cluster centroid.
Add those points to that cluster and the Discard Set.
 Use any main-memory clustering algorithm to cluster the remaining
point and the old Retained Set.
 Clusters go to Compress Set.
 Outlying points go to the Retained Set.
 Adjust statistics of the clusters to account for the new points.

 Add N’s, SUM’s and SUMSQ’s.
 Consider merging compressed sets in the Compressed Set.
BFR ALGORITHM
Continuation:
 If this is the last round, merge all compressed sets in the
Compressed Set and all Retained Set points into their nearest
cluster.
 Comment: In the last round we may treat the Compressed and

Retained sets as an outliers and never cluster them at all.
BFR ALGORITHM
Assigning new point - How Close is close
enough?
Mahalanobis Distance:
Normalized Euclidean distance from centroid.
For point and centroid
The Mahalanobis distance between them is
BFR ALGORITHM
Mahalanobis distance illustration:
BFR ALGORITHM
Assigning new point using Mahalanobis Distance:
 If the clusters are normally dist. In d dimensional, than after
transformation one std. equals .
 That means that approximately 70% of the points of the cluster
will M.D < .
 Assigning rule: Assign new point to a cluster if its M.D <
threshold.
 Threshold may be 4 std.
 In normal dist. 3 std. distance include around 99.7% of the
points. Thus, with that threshold there is a very small chance to
reject a point that truly belong to that cluster.
BFR ALGORITHM
Merging two compressed sets:

 Calculate the variance of the combined cluster.
 We can easily do that using the clusters representation
mentioned before.
 Merge if variance is below some threshold.
(2B) CURE
EXAMPLE
EXAMPLE
FINISHING CURE
CLUSTERING IN NON-EUCLIDEAN
SPACES
GRGPF (V. GANTI, R. RAMAKRISHNAN, J.
GEHRKE, A. POWELL, AND J. FRENCH),
 GRGPF takes ideas from both hierarchical and point-assignment
approaches.
 Like CURE, it represents clusters by sample points in main memory.
However,it also tries to organize the clusters hierarchically, in a tree,
so a new point can be assigned to the appropriate cluster by passing
it down the tree.
 Leaves of the tree hold summaries of some clusters
 Interior nodes hold subsets of the information describing the clusters
reachable through that node.
WE CONSIDER
1. N, the number of points in the cluster.
2. The clustroid of the cluster, which is defined specifically to be the point in the
cluster that minimizes the sum of the squares of the distances to the other points; that
is, the clustroid is the point in the cluster with the smallest ROWSUM.
3. The rowsum of the clustroid of the cluster.
4. For some chosen constant k, the k points of the cluster that are closest to the
clustroid, and their rowsums. These points are part of the representation in case the
addition of points to the cluster causes the clustroid to change. The assumption is
made that the new clustroid would be one of these k points near the old clustroid.
5. The k points of the cluster that are furthest from the clustroid and their rowsums. These
points are part of the representation so that we can consider whether two clusters are
close enough to merge. The assumption is made that if two clusters are close, then a
pair of points distant from their respective clustroids would be close.
3 STEP
 Initializing the Cluster Tree
 Adding Points to the clusters.
 Splitting and Merging Clusters

CLUSTERING
HIGH DIMENSIONAL DATA
 Clustering high-dimensional data
 Many applications: text documents, DNA micro-array data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
 Subspace-clustering: Search for clusters existing in subspaces of the given
high dimensional data space
 CLIQUE, ProClus, and bi-clustering approaches
 Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly
correlated/redundant
 Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
TRADITIONAL DISTANCE MEASURES MAY NOT
BE EFFECTIVE ON HIGH-D DATA
 Traditional distance measure could be dominated by noises in many
dimensions
 Ex. Which pairs of customers are more similar?
 By Euclidean distance, we get,
 despite Ada and Cathy look more similar

 Clustering should not only consider dimensions but also attributes (features)
 Feature transformation: effective if most dimensions are relevant (PCA
& SVD useful when features are highly correlated/redundant)
 Feature selection: useful to find a subspace where the data have nice
clusters
THE CURSE OF DIMENSIONALITY
(GRAPHS ADAPTED FROM PARSONS ET AL. KDD EXPLORATIONS
2004)
 Data in only one dimension is relatively

packed
 Adding a dimension “stretch” the points
across that dimension, making them further
apart
 Adding more dimensions will make the
points further apart—high dimensional data
is extremely sparse
 Distance measure becomes meaningless—
due to equi-distance
59
WHY SUBSPACE CLUSTERING?
(ADAPTED FROM PARSONS ET AL. SIGKDD EXPLORATIONS
2004)
 Clusters may exist only in some subspaces

 Subspace-clustering: find clusters in all the subspaces
60
CLIQUE (CLUSTERING IN QUEST)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
 Automatically identifying subspaces of a high dimensional data space that allow
better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length interval
 It partitions an m-dimensional data space into non-overlapping rectangular units
 A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
 A cluster is a maximal set of connected dense units within a subspace
BASIC IDEA OF GRID BASED
ALGORITHMS
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a subspace
CLIQUE: THE MAJOR STEPS
 Partition the data space and find the number of points that lie
inside each cell of the partition.
 Identify the subspaces that contain clusters using the Apriori
principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of interests.
 Generate minimal description for the clusters

 Determine maximal regions that cover a cluster of connected dense units for
each cluster
 Determination of minimal cover for each cluster
Salary
(10,000)
=3
0 1 2 3 4 5 6 7
20
30
40
50
Sa
l a
Vacatio
ry
n
60
age
30
Vacatio
n(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
DENSE UNIT-BASED METHOD FOR
SUBSPACE CLUSTERING
COMP5331
Density
Dense unit: a unit if the fraction of data points contained in it

is at least a threshold, T
Income
Age
If T = 20%, these three units are dense.

66
SUBSPACE CLUSTERING
COMP5331
Cluster: a maximal set of connected dense units in k-dimensions
This unit forms another cluster.
Income
Age
If T = 20%, these two units form a cluster.

67
The problem is to find which sub-spaces contain dense units.
The second problem is to find clusters from each sub-space containing dense units
SUBSPACE CLUSTERING
COMP5331
 Step 1: Identify sub-spaces that contain dense units
 Step 2: Identify clusters in each sub-spaces that contain
68
dense units
Suppose we want to find all dense units (e.g.,
STEP 1 dense units with density >= 20%)
COMP5331
 Property
 Ifa set S of points is a cluster in a k-dimensional space, then
69
S is also part of a cluster in any (k-1)-dimensional projections
of the space.
dense units with density >= 20%)
STEP 1
COMP5331
If T = 20%, these two units are dense.
Income
Age
70
If T = 20%, these three units are dense.
STEP 1 dense units with density >= 20%)
COMP5331
 We can make use of apriori approach to solve the
problem
71
STEP 1
COMP5331
A1 A2 A3 A4 A5 A6
I1
I2
Income
I3
I4
Age
With respect to dimension Age,

A3 and A4 are dense units. 72
With respect to dimension Income,
I1, I2 and I3 are dense units
APRIORI
COMP5331
L1
2-dimensional Dense Unit Generation
Candidate Generation
A1 A2 A3 A4 A5 A6
C2
Dense Unit Generation

I1
I2 L2
Income
I3 3-dimensional Dense Unit Generation
I4
Age C3
With respect to dimension Age, Dense Unit Generation

I1, I2 and I3 are dense units L3
…
Suppose we want to find all dense units1.(e.g.,
Join Step
dense units with density >= 20%) 2. Prune Step
APRIORI
COMP5331
L1
2-dimensional Dense Unit Generation
A1 A2 A3 A4 A5 A6
C2
Dense Unit Generation

I1
I2 L2
Income Counting Step
I3 3-dimensional Dense Unit Generation
I4
Age C3
With respect to dimension Age, Dense Unit Generation

I1, I2 and I3 are dense units L3
…
SUBSPACE CLUSTERING
COMP5331
 Step 1: Identify sub-spaces that contain dense units
 Step 2: Identify clusters in each sub-spaces that contain
dense units
75
STEP 2
COMP5331
Cluster 2: A4 and I1
A1 A2 A3 A4 A5 A6
e.g., A4 = 21-25 and I1 = 10k-15k
Cluster 2: Age= 21-25 and Income= 10k-15k
I1
I2
Income
I3
I4
Age
With respect to dimension Age,

A3 and A4 are dense units. Cluster 1: (A3 or A4) and I3
With respect to dimension Income, 76
I1, I2 and I3 are dense units e.g., A3 = 16-20, A4 = 21-25 and I3 = 20k-25k
Cluster 1: Age=16-25 and Income=20k-25k

Clustering Hierarichal Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Hierarichal Method

Uploaded by

Copyright:

Available Formats

UNIT-4

 In other words, similar objects are grouped in one cluster

 Clustering is also called data segmentation in some

 Clustering can also be used for outlier detection

o Unsupervised learning do not rely on predefined classes

o For this reason, clustering is a form of learning by

 A hierarchical clustering method works by grouping data

 We can classify hierarchical methods on the basis of

Other clustering algorithms distinctions:

 Whether the algorithm assumes that the data is small enough to

 It successively merges the objects or groups that are

 In each successive iteration, It subdivides the cluster into

-While stop condition is false Do

1. How do you represent a cluster with more than one

 Merging rule: merge the two clusters with the shortest

 Stopping rules: We may know in advance how many

Alternative rules for controlling hierarchical

And the diameter will be

2. The maximum distance to another point in the cluster.

Applying the three clustroid criteria to each of the four points:

 Average distance between all pair of points.

 Using radius or diameter (the same definition).

 Assumes that clusters are normally distributed around a

 Points are read one main-memory-full at a time.

 Select points which are far from one another as in the k-

 How to summarize will explained later.

Processed point representation in memory:

 The vector SUM whose i-th component is the sum of

for the i-th coordinate.

Processing a chunk of points (memory-load) data:

 Adjust statistics of the clusters to account for the new points.

 Comment: In the last round we may treat the Compressed and

Merging two compressed sets:

 Splitting and Merging Clusters

 By Euclidean distance, we get,

 despite Ada and Cathy look more similar

 Data in only one dimension is relatively

 Clusters may exist only in some subspaces

 Generate minimal description for the clusters

Dense unit: a unit if the fraction of data points contained in it

If T = 20%, these three units are dense.

This unit forms another cluster.

If T = 20%, these two units form a cluster.

With respect to dimension Age,

Dense Unit Generation

With respect to dimension Age, Dense Unit Generation

Dense Unit Generation

With respect to dimension Age, Dense Unit Generation

With respect to dimension Age,

You might also like