Professional Documents
Culture Documents
Materi Pertemuan 9. Klastering A
Materi Pertemuan 9. Klastering A
Klastering
Metode Learning dalam Data Mining
Meta Learning*
Reinforcement
Learning*
2
*) tidak dibahas dalam mata kuliah ini
Unsupervised Learning:
CLUSTERING
3
Clustering on Image Segmentation:
Painting from A Photo !!
4
5
6
CLUSTERING
Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Clustering on Graph
Evaluation
7
BASIC CONCEPTS
8
High Dimensional Data
Given a cloud of data points we want to
understand its structure
9
Clustering Problem
Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
Members of a cluster are close/similar to each other
Members of different clusters are dissimilar
Usually:
Points are in a high-dimensional space
Similarity is defined using a distance measure
Euclidean, Cosine, Jaccard, edit distance, …
10
Example: Clusters & Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
11
Clustering is a hard problem!
12
Why is it hard?
Clustering in two dimensions looks easy
Clustering small amounts of data looks easy
And in most cases, looks are not deceiving
14
Clustering Problem: Documents
Finding topics:
Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
It actually doesn’t matter if k is infinite; i.e., we don’t
limit the set of words
15
Similarity Measurement
Vectors: Measure similarity by the Cosine similarity
16
What is Clustering?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
17
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA, classification,
and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection: Outliers are often viewed as those “far away” from
any cluster
18
Basic Steps to Develop a Clustering Task
Feature selection
Select info concerning the task of interest
Minimal information redundancy
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
Validation of the results
Validation test (clustering index: PC, CE, NMI, IFV, PCAES)
Interpretation of the results
Integration with applications
19
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
20
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality 21
1. PARTITIONING METHODS
22
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into
a set of k clusters, such that the sum of squared distances is
minimized (where ci is the centroid or medoid of cluster Ci)
E ik1 pCi (d ( p, ci ))2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
23
k–means Algorithm
24
24
Populating Clusters
1) For each point, place it in the cluster whose
current centroid it is nearest
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters after round 1
26
26
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters after round 2
27
27
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters at the end
28
28
Selecting k (number of clusters)
Elbow Method
Try different k, looking at the change in the
average distance to centroid as k increases
Average falls rapidly until right k, then changes
little
Best value
of k
Average
distance to
centroid
k
29
29
Example: Selecting k
Too few;
many long
distances
to centroid.
30
Example: Selecting k
Just right;
distances
rather short.
31
Example: Selecting k
Too many;
little improvement
in average
distance.
32
Tahapan Algoritma k-Means
1. Pilih jumlah klaster k yang diinginkan
2. Inisialisasi k pusat klaster (centroid) secara random
3. Tempatkan setiap data atau objek ke klaster terdekat. Kedekatan dua objek
ditentukan berdasar jarak. Jarak yang dipakai pada algoritma k-Means
adalah Euclidean distance (d)
n
d Euclidean x, y
i i 2
x y
i 1
x = x1, x2, . . . , xn, dan y = y1, y2, . . . , yn merupakan banyaknya n atribut(kolom) antara 2
record
4. Hitung kembali pusat klaster dengan keanggotaan klaster yang sekarang.
Pusat klaster adalah rata-rata (mean) dari semua data atau objek dalam
klaster tertentu
5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru. Jika
pusat klaster sudah tidak berubah lagi, maka proses pengklasteran selesai.
Atau, kembali lagi ke langkah nomor 3 sampai pusat klaster tidak berubah
lagi (stabil) atau tidak ada penurunan yang signifikan dari nilai SSE (Sum of
Squared Errors) 33
An Example of K-Means Clustering
K=2
34
Strength of K-Means
Strength:
1. Efficient : O(n), where n is the number of data
2. Simple implementation
3. Guarantees convergence
4. Easily adapt the new data
(Google Developers)
35
Weaknesses of K-Means
Weaknesses:
1. Not good at clusters with different densities and size
2. Choose k manually
3. Sensitive to noisy data and outliers
4. Dependent on initial values
5. Not suitable to discover clusters with non-convex shapes
6. Scaling with number of dimensions
36
Weaknesses of K-Means
Clusters with Different Densities and Sizes
37
Weaknesses of K-Means
Clusters with Different Densities and Sizes
39
Weaknesses of K-Means
Can’t handle non-convex clusters
41
Weaknesses of K-Means
The k-means algorithm is sensitive to outliers!
K-Medoids:
Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally
located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
42
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
choose k each
6 6 6
5 5 5
4
object as 4 remainin 4
3 initial 3 g object 3
2 medoids 2 to 2
1 1 nearest 1
0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
medoids 0
0 1 2 3 4 5 6 7 8 9 10
Randomly select a
K=2 Total Cost = 26 nonmedoid object,Oramdom
Do loop
10 10
Compute
9 9
Swapping O
8 8
If quality is
5 5
4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
43
43
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS (Ng & Han, 1994): Randomized re-sampling
44
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
46
Hierarchical Clustering
Use distance matrix as clustering criteria
This method does not require the number of clusters k as an
input, but needs a termination condition
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
48
Dendrogram: Shows How Clusters are Merged
49
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
50
Distance between Clusters X X
Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
51
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN 1(t
Cm ip )
N
52