0% found this document useful (0 votes)
44 views84 pages

Understanding Cluster Analysis Techniques

Cluster analysis is a method for grouping similar data objects into clusters based on their characteristics, commonly used in various fields such as marketing and spatial data analysis. It involves different types of data and major clustering methods, including partitioning and hierarchical approaches, with the k-means algorithm being a notable example. The effectiveness of clustering depends on the quality of clusters formed, which is influenced by the chosen similarity measures and the handling of various data types.

Uploaded by

ibk2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views84 pages

Understanding Cluster Analysis Techniques

Cluster analysis is a method for grouping similar data objects into clusters based on their characteristics, commonly used in various fields such as marketing and spatial data analysis. It involves different types of data and major clustering methods, including partitioning and hierarchical approaches, with the k-means algorithm being a notable example. The effectiveness of clustering depends on the quality of clusters formed, which is influenced by the chosen similarity measures and the handling of various data types.

Uploaded by

ibk2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Cluster Analysis

1. What is Cluster Analysis?


2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods

March 26, 2025 Data Mining: Concepts and Techniques 1


What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping
similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
March 26, 2025 Data Mining: Concepts and Techniques 2
Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis

Create thematic maps in GIS by clustering feature
spaces

Detect spatial clusters or for other spatial mining
tasks
 Image Processing
 Economic Science (especially market research)
 WWW

Document classification

Cluster Weblog data to discover groups of similar
access patterns
March 26, 2025 Data Mining: Concepts and Techniques 3
Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
March 26, 2025 Data Mining: Concepts and Techniques 4
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters with

high intra-class similarity

low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns
March 26, 2025 Data Mining: Concepts and Techniques 5
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is expressed
in terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures
the “goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good
enough”
 the answer is typically highly subjective.
March 26, 2025 Data Mining: Concepts and Techniques 6
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
March 26, 2025 Data Mining: Concepts and Techniques 7
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods

March 26, 2025 Data Mining: Concepts and Techniques 8


Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 0 
 d(2,1) 
 Dissimilarity matrix  0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

March 26, 2025 Data Mining: Concepts and Techniques 9


Type of data in cluster analysis
 Interval-scaled variables
 e.g., salary, temperature
 Binary variables
 e.g., gender (M/F), has_cancer(T/F)
 Nominal (categorical) variables
 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
 Ratio-scaled variables
 population growth (1,10,100,1000,...)
 Variables of mixed types
 multiple attributes with various types
Interval-valued variables

 Standardize data

Calculate the mean absolute deviation:
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )


.


Calculate the standardized measurement (z-
score)

 Using mean absolute deviation is more robust than


using standard deviation
March 26, 2025 Data Mining: Concepts and Techniques 11
Similarity and Dissimilarity
Between Objects

 Distances are normally used to measure the


similarity or dissimilarity between two data
objects
 Some popular
d (i, j) q (| ones
x  x |include:
q
 | x  x Minkowski
|q ... | x  x |distance:
q
)
i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)


are two p-dimensional data objects, and q is a
positive integer
 , j) | x  x |  |distance
If q = 1, d isd (iManhattan x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

March 26, 2025 Data Mining: Concepts and Techniques 12


Similarity and Dissimilarity
Between Objects (Cont.)

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

March 26, 2025 Data Mining: Concepts and Techniques 13


Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Summary

March 26, 2025 Data Mining: Concepts and Techniques 14


Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: DIANA, AGNES, BIRCH, ROCK, CAMELEON

March 26, 2025 Data Mining: Concepts and Techniques 15


Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods

March 26, 2025 Data Mining: Concepts and Techniques 16


Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters, s.t., min sum of squared
distance k 2
 m1tmiKm (Cm  tmi )

 Given k, find a partition of k clusters that optimizes the


chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
March 26, 2025 Data Mining: Concepts and Techniques 17
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented


in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the
nearest seed point
 Go back to Step 2, stop when no more new
assignment

March 26, 2025 Data Mining: Concepts and Techniques 18


The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose 7 7

6 6
K object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

March 26, 2025 Data Mining: Concepts and Techniques 19


Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is


# clusters, and t is # iterations. Normally, k, t << n.
 Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

March 26, 2025 Data Mining: Concepts and Techniques 20


Challenges
 When the size of clusters is different

Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/”
March 26, 2025 Data Mining: Concepts and Techniques 21
Challenges
 When the densities of original points are
different

Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/”
March 26, 2025 Data Mining: Concepts and Techniques 22
Variations of the K-Means Method

 A few variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

March 26, 2025 Data Mining: Concepts and Techniques 23


K-Means++ Clustering
 Aids in choosing initial cluster centroid for K-Means
clustering
 The steps to initialize the centroids using K-Means++
are:
 The first cluster is chosen uniformly at random from the data
points that we want to cluster. This is similar to what we do in
K-Means, but instead of randomly picking all the centroids, we
just pick one centroid here

 Next, we compute the distance (D(x)) of each data point (x)


from the cluster center that has already been chosen

 Then, choose the new cluster center from the data points with
the probability of x being proportional to (D(x))2

 We then repeat steps 2 and 3 until k clusters have been chosen


March 26, 2025 Data Mining: Concepts and Techniques 24
K-Means++ Clustering

Original Data Points 1. Random point is


chosen as initial
cluster centroid

2. Calculate the
distance of each 3. The farthest
point from the distance point is
centroid chosen as another
March 26, 2025 cluster centroid
Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means- 25
K-Means++ Clustering

4. Compute distance 5. The data point


between each point with the highest
and the closest squared distance is
centroid the next centroid

Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/”
March 26, 2025 26
Choosing right number of
clusters
 Elbow method – plot an elbow curve, where x-axis
represent the number of clusters and the y-axis
will be an evaluation metric.
 Choose any one metric like inertia or Dunn index
and plot its value against the clusters

Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
March 26, 2025 27
 https://codinginfinite.com/k-means-
clustering-explained-with-numerical-
example/

March 26, 2025 Data Mining: Concepts and Techniques 28


What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may
substantially distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

March 26, 2025 Data Mining: Concepts and Techniques 29


Partitioning Around Medoids
(PAM)
 Steps to find cluster centers
 Choose k data points from the scatter plot as

starting points for cluster centers.


 Calculate their distance from all the points in

the scatter plot.


 Classify each point into the cluster whose

center it is closest to.


 Select a new point in each cluster that

minimizes the sum of distances of all points


in that cluster from itself. (Swap phase)
 Repeat Step 2 until the centers stop

changing.
March 26, 2025 Data Mining: Concepts and Techniques 30
Hierarchical Clustering
 Produces a set of nested clusters
organized as a hierarchical tree
 Can be visualized as a dendrogram
 A tree-like diagram that records the

sequences of merges or splits

6 5
0.2
4
3 4
0.15 2
5

0.1 2

0.05
1
3 1

0
1 3 2 5 4 6
Hierarchical Clustering
 Given a set of N items to be clustered, and an NxN distance
(or similarity) matrix, the basic process of hierarchical
clustering is as follows:

1. Start by assigning each item to its own cluster, so that if you


have N items, you now have N clusters, each containing just
one item.

2. Find the closest (most similar) pair of clusters and merge


them into a single cluster, so that now you have one less
cluster.

3. Compute distances (similarities) between the new cluster


and each of the old clusters

4. Repeat steps 2 and 3 until all items are clustered into a


single cluster of size N.
March 26, 2025 Data Mining: Concepts and Techniques 32
Hierarchical Clustering

1. Scan the matrix for the minimum

2. Join items into one node

3. Update matrix and repeat from step 1


Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one

cluster (or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point

(or there are k clusters)

• Traditional hierarchical algorithms use a similarity


or distance matrix
– Merge or split one cluster at a time
Agglomerative clustering algorithm

 Most popular hierarchical clustering technique


 Basic algorithm
1. Compute the distance matrix between the input data
points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
 Key operation is the computation of the
distance between two clusters
 Different definitions of the distance between clusters
lead to different algorithms
Input/ Initial setting
 Start with clusters of individual points and a
distance/proximity matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
.
Distance/Proximity
Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
 After some merging steps, we have some clusters

C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1
Distance/Proximity
Matrix
C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
 Merge the two closest clusters (C2 and C5) and update
the distance matrix.
C1 C2 C3 C4 C5
C1

C3 C2
C3
C4
C4
C5
C1
Distance/Proximity
Matrix
C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
 “How do we update the distance matrix?”

C2
U
C1 C5 C3 C4

C3 C1 ?

C4 C2 U C5 ? ? ? ?

C3 ?

C1 C4 ?

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Hierarchical Clustering

Distance between two points – easy to compute


Distance between two clusters – harder to compute:

1. Single-Link Method / Nearest Neighbor


2. Complete-Link / Furthest Neighbor
3. Average of all cross-cluster pairs
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(K i, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
March 26, 2025 Data Mining: Concepts and Techniques 41
Distance between two clusters
 Single-link distance between clusters Ci
and Cj is the minimum distance between
any object in Ci and any object in Cj

 The distance is defined by the two most


similar objects


Dsl Ci , C j  min x , y d ( x, y ) x  Ci , y  C j 
Single-link clustering: example

5
1
3
5 0.2

2 1 0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


Strengths of single-link clustering

Original Points Two Clusters

• Can handle non-elliptical shapes


Limitations of single-link clustering

Original Points Two Clusters

• Sensitive to noise and outliers


• It produces long, elongated clusters
Distance between two clusters
 Complete-link distance between clusters
Ci and Cj is the maximum distance
between any object in Ci and any object in
Cj

 The distance is defined by the two most


dissimilar objects


Dcl Ci , C j  max x , y d ( x, y ) x  Ci , y  C j 
Complete-link clustering: example

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6
0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5

Nested Clusters Dendrogra


m
Strengths of complete-link clustering

Original Points Two Clusters

• More balanced clusters (with equal


diameter)
• Less susceptible to noise
Limitations of complete-link clustering

Original Points Two Clusters

• Tends to break large clusters


• All clusters tend to have the same

diameter – small clusters are merged


Distance between two clusters
 Group average distance between
clusters Ci and Cj is the average distance
between any object in Ci and any object in
Cj

1
Davg Ci , C j    d ( x, y )
Ci C j xCi , yC j
Average-link clustering: example

5 4 1
2
0.25

5 0.2

2 0.15

3 6 0.1

1 0.05

4 0
3 3 6 4 1 2 5

Nested Clusters Dendrogram


Average-link clustering:
discussion
 Compromise between Single and
Complete Link

 Strengths
 Less susceptible to noise and outliers

 Limitations
 Biased towards globular clusters
Distance between two clusters
 Centroid distance between clusters Ci
and Cj is the distance between the centroid
ri of Ci and the centroid rj of Cj

Dcentroids Ci , C j  d (ri , rj )


Distance between two clusters
• Ward’s distance between clusters Ci and Cj is
the difference between the total within
cluster sum of squares for the two clusters
separately, and the within cluster sum of
squares resulting from merging the two
clusters in cluster Cij
Dw Ci , C j    x  ri    x  rj    x  r 
2 2 2
ij
xCi xC j xCij

• ri: centroid of Ci
• rj: centroid of Cj
• rij: centroid of Cij
Ward’s distance for clusters
 Similar to group average and centroid
distance

 Less susceptible to noise and outliers

 Biased towards globular clusters

 Hierarchical analogue of k-means


 Can be used to initialize k-means
Hierarchical Clustering:
Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering
In a dendrogram, the length of each tree branch represents
the distance between clusters it joins.

Different dendrograms may arise when different Linkage


methods are used.
Hierarchical Clustering: Time and
Space requirements
• For a dataset X consisting of n points

• O(n2) space; it requires storing the


distance matrix

• O(n3) time in most of the cases


– There are n steps and at each step the

size n2 distance matrix must be updated


and searched
– Complexity can be reduced to O(n2

log(n) ) time for some approaches by


using appropriate data structures
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


March 26, 2025 Techniques 59
Divisive hierarchical clustering
• Start with a single cluster composed of all data points

• Split this into components

• Continue recursively

• Monothetic divisive methods split clusters using one


variable/dimension at a time

• Polythetic divisive methods make splits on the basis of all


variables together

• Any intercluster distance measure can be used

• Computationally intensive, less widely used than


agglomerative methods
Manual workout References

https://people.revoledu.com/kardi/
tutorial/Clustering/Numerical
%20Example.htm

https://online.stat.psu.edu/
stat555/node/86/

March 26, 2025 Data Mining: Concepts and Techniques 61


Strengths of Hierarchical
Clustering
 No assumptions on the number of clusters
 Any desired number of clusters can be

obtained by ‘cutting’ the dendogram at


the proper level

 Hierarchical clusterings may correspond to


meaningful taxonomies
 Example in biological sciences (e.g.,

phylogeny reconstruction, etc), web


(e.g., product catalogs) etc
Clustering Algorithms
K-means vs. Hierarchical clustering:
 Computation Time:
– Hierarchical clustering: O( m n2 log(n) )
– K-means clustering: O( k t m n )
 t: number of iterations
 n: number of objects
 m-dimensional vectors
 k: number of clusters
 Memory Requirements:
– Hierarchical clustering: O( mn + n2 )
– K-means clustering: O( mn + kn )
 Other:
 Hierarchical Clustering:

Need to select Linkage Method

to perform any analysis, it is necessary to partition the dendrogram into k disjoint
clusters, cutting the dendrogram at some point. A limitation is that it is not clear
how to choose this k
 K-means: Need to select K
 In both cases: Need to select distance/similarity measure
More on Hierarchical Clustering
Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),

where n is the number of total objects


 can never undo what was done previously

 Integration of hierarchical with distance-based


clustering
 BIRCH (1996): uses CF-tree and incrementally

adjusts the quality of sub-clusters


 CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of


the cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using

dynamic modelingData Mining: Concepts and


March 26, 2025 Techniques 64
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to
the order of the data record.
Data Mining: Concepts and
March 26, 2025 Techniques 65
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10

9
(3,4)
(2,6)
8

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


March 26, 2025 Techniques 66
BIRCH

Data Mining: Concepts and


March 26, 2025 Techniques 67
BIRCH - Example

Data Mining: Concepts and


March 26, 2025 Techniques 68
BIRCH

Data Mining: Concepts and


March 26, 2025 Techniques 69
BIRCH Algorithm Phases

Data Mining: Concepts and


March 26, 2025 Techniques 70
BIRCH Algorithm Phases

Data Mining: Concepts and


March 26, 2025 Techniques 71
BIRCH Algorithm Phases

Data Mining: Concepts and


March 26, 2025 Techniques 72
CF Tree Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

Data Mining: Concepts and


March 26, 2025 Techniques 73
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 74
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 75
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 76
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 77
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 78
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 79
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 80
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 81
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 82
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 83
Numeric Example

Data Mining: Concepts and


March 26, 2025 Techniques 84

You might also like