Professional Documents
Culture Documents
— Chapter 10 —
Cluster analysis
Finding similarities between data according to the
3
4
What is Clustering
5
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
6
Quality: What Is Good Clustering?
7
Types of Clustering
1. Partitioning approach:
Construct various partitions and then evaluate them by
some criterion
2. Hierarchical approach:
Create a hierarchical decomposition of the set of data
4. Grid-based approach:
based on a multiple-level granularity structure
8
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
9
1) Partitioning Method
K-Mean Algorithm
K-Mediods Algorithm
CLARANS (Clustering Based Algorithm for
Randomize Search)
11
2-Hierarichal Methods
13
3-Density Based Method
14
Algorithm for Density Based methods
15
4-Grid Based Clustering
16
Algorithm for Grid Based methods
17
18
Common Distance measures:
E pCi (d ( p, ci ))
k
i 1
2
20
Distance formula (2-D)
21
K-MEANS CLUSTERING
K=2
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
Step 3:
Now using these centroids
Therefore, there is no
change in the cluster.
Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
PLOT
(with K=3)
Step 1 Step 2
PLOT
Exercise
Consider the 1D data set as
{1,2,3,4,7,9}
Where K=2
Identify the clusters and their centroid.
Tip use
37
Home work
Use the k-means algorithm and Euclidean distance
to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Map the resultant values in Scattered Plot
38
What Is the Problem of the K-Means Method?
39
Determine the Number of Clusters
Empirical method
# of clusters: k ≈√n/2
40
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the
dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm
41
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
42
Visualization of Clustering
43
44
45
46
47
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
48