DM BS Lec8 Clustering

Data Mining:
Concepts and Techniques

(3rd ed.)
— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods
 Density Methods
 Grid Based Methods
 Evaluation of Clustering
 Summary
2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar

data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
3
4
What is Clustering
 Clustering is the classification of objects into

different groups, or more precisely, the
partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share
some common trait - often according to some
defined distance measure.
5
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
6
Quality: What Is Good Clustering?
 A good clustering method will produce high quality

clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden
patterns
7
Types of Clustering
1. Partitioning approach:
 Construct various partitions and then evaluate them by
some criterion
2. Hierarchical approach:
 Create a hierarchical decomposition of the set of data
(or objects) using some criterion

3. Density-based approach:
 Based on connectivity and density functions
4. Grid-based approach:
 based on a multiple-level granularity structure
8

 Summary
9
1) Partitioning Method
June 11, 2020 Data Mining: Concepts 10

Algorithm for Partitioning methods
 K-Mean Algorithm
 K-Mediods Algorithm
 CLARANS (Clustering Based Algorithm for
Randomize Search)
11
2-Hierarichal Methods
June 11, 2020 Data Mining: Concepts 12

Algorithm for Hierarchal methods
 AGNES (AGglomerative NESting Clustering)

 DIANA (DIisive ANalysis Clustering )
 BIRCH (Balance Iterative Reducing and
Clustering)
 CAMELEON (CLUSTERING USING DYNAMIC
MODELING)
 DenClue
13
3-Density Based Method
14
Algorithm for Density Based methods
 DBSCAN (Density-Based Clustering Based on Connected

 Regions with High Density)
 OPTICS (Ordering Points to Identify the Clustering
Structure)
15
4-Grid Based Clustering
16
Algorithm for Grid Based methods
 STING (STatistical Information Grid)

 CLIQUE: An Apriori-like Subspace Clustering
Method
 WaveCluste
17
18
Common Distance measures:
 Distance measure will determine how the similarity of two

elements is calculated and it will influence the shape of
the clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is
given by:
2. The Manhattan distance (also called taxicab norm or 1-

norm) is given by:
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set
of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
E    pCi (d ( p, ci ))
k
i 1
2
 Global optimal: exhaustively enumerate all partitions

 Heuristic methods: k-means and k-medoids algorithms
20
Distance formula (2-D)
21
K-MEANS CLUSTERING
 The k-means algorithm is an algorithm to cluster n

objects based on attributes into k partitions, where k
< n.
 It is similar to the expectation-maximization algorithm
for mixtures of Gaussians in that they both attempt to
find the centers of natural clusters in the data.
 It assumes that the object attributes form a vector
space.
 An algorithm for partitioning (or clustering) N data
points into K disjoint subsets Sj containing data
points so as to minimize the sum-of-squares
criterion
where xn is a vector representing the the nth data

point and uj is the geometric centroid of the data
points in Sj.
 Simply speaking k-means clustering is an
algorithm to classify or to group the objects based
on attributes/features into K number of group.
 K is positive integer number.
 The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
An Example of K-Means Clustering
K=2
Arbitrarily Update the

partition cluster
objects into centroids
k groups
The initial data set Loop if Reassign objects

needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
25
How the K-Mean Clustering algorithm
works?
 Step 1: Begin with a decision on the value of k =
number of clusters .
 Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly,or systematically
as the following:
1.Take the first k training sample as single-
element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute the
centroid of the gaining cluster.
 Step 3: Take each sample in sequence and
compute its distance from the centroid of
each of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that
cluster and update the centroid of the
cluster gaining the new sample and
the cluster losing the sample.
 Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
 Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
Step 3:
 Now using these centroids
we compute the Euclidean

distance of each object, as
shown in table.
 Therefore, the new

clusters are:
{1,2} and {3,4,5,6,7}
 Next centroids are:

m1=(1.25,1.5) and m2 =
(3.9,5.1)
 Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
 Therefore, there is no
change in the cluster.
 Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
PLOT
(with K=3)
Step 1 Step 2
PLOT
Exercise
 Consider the 1D data set as
{1,2,3,4,7,9}
Where K=2
Identify the clusters and their centroid.
Tip use
37
Home work
 Use the k-means algorithm and Euclidean distance
to cluster the following 8 examples into 3 clusters:
 A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
 Map the resultant values in Scattered Plot
38
What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster
39
Determine the Number of Clusters
 Empirical method
 # of clusters: k ≈√n/2
for a dataset of n points,

 e.g., n = 200, k = 10
 How many for the n=900???
40
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the
dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
 Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm
41
 Summary
42
Visualization of Clustering
43
44
45
46
47
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
48

DM BS Lec8 Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM BS Lec8 Clustering

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Cluster Analysis: Basic Concepts

 dissimilar (or unrelated) to the objects in other groups

characteristics found in the data and grouping similar

 As a preprocessing step for other algorithms

 Clustering is the classification of objects into

 A good clustering method will produce high quality

(or objects) using some criterion

 Cluster Analysis: Basic Concepts

June 11, 2020 Data Mining: Concepts 10

June 11, 2020 Data Mining: Concepts 12

 AGNES (AGglomerative NESting Clustering)

 DBSCAN (Density-Based Clustering Based on Connected

 STING (STatistical Information Grid)

 Distance measure will determine how the similarity of two

2. The Manhattan distance (also called taxicab norm or 1-

 Global optimal: exhaustively enumerate all partitions

 The k-means algorithm is an algorithm to cluster n

where xn is a vector representing the the nth data

Arbitrarily Update the

The initial data set Loop if Reassign objects

we compute the Euclidean

 Therefore, the new

 Next centroids are:

 The k-means algorithm is sensitive to outliers !

for a dataset of n points,

 How many for the n=900???

You might also like