Materi Pertemuan 9. Klastering A

Pertemuan 9.
Klastering
Metode Learning dalam Data Mining
Meta Learning*
Supervised Semi- Unsupervised

Supervised
Learning Learning* Learning
Reinforcement
Learning*
2
*) tidak dibahas dalam mata kuliah ini
Unsupervised Learning:
CLUSTERING
3
Clustering on Image Segmentation:
Painting from A Photo !!
4
5
6
CLUSTERING
Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Clustering on Graph
Evaluation
7
BASIC CONCEPTS
8
High Dimensional Data
Given a cloud of data points we want to
understand its structure
9
Clustering Problem
Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
Members of a cluster are close/similar to each other
Members of different clusters are dissimilar
Usually:
Points are in a high-dimensional space
Similarity is defined using a distance measure
Euclidean, Cosine, Jaccard, edit distance, …
10
Example: Clusters & Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
11
Clustering is a hard problem!
12
Why is it hard?
Clustering in two dimensions looks easy
Clustering small amounts of data looks easy
And in most cases, looks are not deceiving
Many applications involve not 2, but 10 or 10,000

dimensions
High-dimensional spaces look different: Almost
all pairs of points are at about the same
distance
13
Clustering Problem: Galaxies
A catalog of 2 billion “sky objects” represents
objects by their radiation in 7 dimensions
(frequency bands)
Problem: Cluster into similar objects, e.g., galaxies,
nearby stars, quasars, etc.
Sloan Digital Sky Survey
14
Clustering Problem: Documents
Finding topics:
Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
It actually doesn’t matter if k is infinite; i.e., we don’t
limit the set of words
Documents with similar sets of words may be about the

same topic
15
Similarity Measurement
Vectors: Measure similarity by the Cosine similarity
Sets : Measure similarity by the Jaccard distance
Points: Measure similarity by Euclidean distance
16
What is Clustering?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Clustering (or cluster analysis, data segmentation, …)

Finding similarities between data according to the characteristics
found in the data and naturally grouping similar data objects
into clusters
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
17
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA, classification,
and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection: Outliers are often viewed as those “far away” from
any cluster
18
Basic Steps to Develop a Clustering Task
Feature selection
Select info concerning the task of interest
Minimal information redundancy
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
Validation of the results
Validation test (clustering index: PC, CE, NMI, IFV, PCAES)
Interpretation of the results
Integration with applications
19
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
20
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality 21
1. PARTITIONING METHODS
22
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into
a set of k clusters, such that the sum of squared distances is
minimized (where ci is the centroid or medoid of cluster Ci)
E  ik1 pCi (d ( p, ci ))2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
23
k–means Algorithm
Assumes Euclidean space/distance
Start by picking k, the number of clusters
Initialize clusters by picking one point per cluster

Example: Pick one point at random, then k-1 other
points, each as far away as possible from
the previous points
24
24
Populating Clusters
1) For each point, place it in the cluster whose
current centroid it is nearest
2) After all points are assigned, update the

locations of centroids of the k clusters
3) Reassign all points to their closest centroid

Sometimes moves points between clusters
Repeat 2 and 3 until convergence

Convergence: Points don’t move between clusters and
centroids stabilize 25
25
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters after round 1
26
26
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters after round 2
27
27
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters at the end
28
28
Selecting k (number of clusters)
Elbow Method
Try different k, looking at the change in the
average distance to centroid as k increases
Average falls rapidly until right k, then changes
little
Best value
of k
Average
distance to
centroid
k
29
29
Example: Selecting k
Too few;
many long
distances
to centroid.
30
Just right;
distances
rather short.
31
Too many;
little improvement
in average
distance.
32
Tahapan Algoritma k-Means
1. Pilih jumlah klaster k yang diinginkan
2. Inisialisasi k pusat klaster (centroid) secara random
3. Tempatkan setiap data atau objek ke klaster terdekat. Kedekatan dua objek
ditentukan berdasar jarak. Jarak yang dipakai pada algoritma k-Means
adalah Euclidean distance (d)
n
d Euclidean x, y   
 i i  2
x y
i 1
x = x1, x2, . . . , xn, dan y = y1, y2, . . . , yn merupakan banyaknya n atribut(kolom) antara 2
record
4. Hitung kembali pusat klaster dengan keanggotaan klaster yang sekarang.
Pusat klaster adalah rata-rata (mean) dari semua data atau objek dalam
klaster tertentu
5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru. Jika
pusat klaster sudah tidak berubah lagi, maka proses pengklasteran selesai.
Atau, kembali lagi ke langkah nomor 3 sampai pusat klaster tidak berubah
lagi (stabil) atau tidak ada penurunan yang signifikan dari nilai SSE (Sum of
Squared Errors) 33
An Example of K-Means Clustering
K=2
Arbitrarily Update the

partition cluster
objects into centroids
k groups
The initial data set Loop if Reassign objects

needed
 Partition objects into k
nonempty subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition Update the
 Assign each object to the cluster
cluster of its nearest centroid centroids
 Until no change
34
Strength of K-Means
Strength:
1. Efficient : O(n), where n is the number of data
2. Simple implementation
3. Guarantees convergence
4. Easily adapt the new data
(Google Developers)
35
Weaknesses of K-Means
Weaknesses:
1. Not good at clusters with different densities and size
2. Choose k manually
3. Sensitive to noisy data and outliers
4. Dependent on initial values
5. Not suitable to discover clusters with non-convex shapes
6. Scaling with number of dimensions
36
Clusters with Different Densities and Sizes
Natural clusters K-Means resulted clusters
37
Clusters with Different Densities and Sizes
Solution: Use Generalized K-Means

38
Can’t handle non-convex clusters
input K-means output
39
Solution: Combine with other methods, such as agglomerative 40

K-means
41
The k-means algorithm is sensitive to outliers!
K-Medoids:
Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally
located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
42
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
choose k each
6 6 6
5 5 5
4
object as 4 remainin 4
3 initial 3 g object 3
2 medoids 2 to 2
1 1 nearest 1
0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
medoids 0
0 1 2 3 4 5 6 7 8 9 10
Randomly select a
K=2 Total Cost = 26 nonmedoid object,Oramdom
Do loop
10 10
Compute
9 9
Swapping O
8 8
Until no change 7 total cost of 7
and Oramdom 6 swapping 6
If quality is
5 5
4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
43
43
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS (Ng & Han, 1994): Randomized re-sampling
44
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes

Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical
objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype
method
45
2. HIERARCHICAL METHODS
46
Hierarchical Clustering
Use distance matrix as clustering criteria
This method does not require the number of clusters k as an
input, but needs a termination condition
Step Step Step Step Step agglomerative

0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0
47
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
48
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of

clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at

the desired level, then each connected component forms a cluster
49
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
50
Distance between Clusters X X
Single link: smallest distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an element in

the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
51
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN 1(t
Cm  ip )
N
Radius: square root of average

distance from any point of the
 N (t  cm ) 2
cluster to its centroid Rm  i 1 ip
N
Diameter: square root of average

mean squared distance between
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
all pairs of points in the cluster
N ( N 1)
52

Materi Pertemuan 9. Klastering A

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Materi Pertemuan 9. Klastering A

Uploaded by

Copyright:

Available Formats

Pertemuan 9.

Supervised Semi- Unsupervised

Many applications involve not 2, but 10 or 10,000

Documents with similar sets of words may be about the

Sets : Measure similarity by the Jaccard distance

Points: Measure similarity by Euclidean distance

Clustering (or cluster analysis, data segmentation, …)

Assumes Euclidean space/distance

Start by picking k, the number of clusters

Initialize clusters by picking one point per cluster

2) After all points are assigned, update the

3) Reassign all points to their closest centroid

Repeat 2 and 3 until convergence

Arbitrarily Update the

The initial data set Loop if Reassign objects

Natural clusters K-Means resulted clusters

Solution: Use Generalized K-Means

input K-means output

Solution: Combine with other methods, such as agglomerative 40

Can’t handle non-convex clusters

Until no change 7 total cost of 7

and Oramdom 6 swapping 6

Handling categorical data: k-modes

Step Step Step Step Step agglomerative

Decompose data objects into a several levels of nested partitioning (tree of

A clustering of the data objects is obtained by cutting the dendrogram at

Introduced in Kaufmann and Rousseeuw (1990)

Single link: smallest distance between an element in one cluster and an

Complete link: largest distance between an element in one cluster and an

Average: avg distance between an element in one cluster and an element in

Radius: square root of average

Diameter: square root of average

You might also like