You are on page 1of 52

Pertemuan 9.

Klastering
Metode Learning dalam Data Mining
Meta Learning*

Supervised Semi- Unsupervised


Supervised
Learning Learning* Learning

Reinforcement
Learning*
2
*) tidak dibahas dalam mata kuliah ini
Unsupervised Learning:

CLUSTERING

3
Clustering on Image Segmentation:
Painting from A Photo !!

4
5
6
CLUSTERING

Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Clustering on Graph
Evaluation
7
BASIC CONCEPTS

8
High Dimensional Data
Given a cloud of data points we want to
understand its structure

9
Clustering Problem
Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
Members of a cluster are close/similar to each other
Members of different clusters are dissimilar
Usually:
Points are in a high-dimensional space
Similarity is defined using a distance measure
Euclidean, Cosine, Jaccard, edit distance, …

10
Example: Clusters & Outliers

x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Outlier Cluster

11
Clustering is a hard problem!

12
Why is it hard?
Clustering in two dimensions looks easy
Clustering small amounts of data looks easy
And in most cases, looks are not deceiving

Many applications involve not 2, but 10 or 10,000


dimensions
High-dimensional spaces look different: Almost
all pairs of points are at about the same
distance
13
Clustering Problem: Galaxies
A catalog of 2 billion “sky objects” represents
objects by their radiation in 7 dimensions
(frequency bands)
Problem: Cluster into similar objects, e.g., galaxies,
nearby stars, quasars, etc.
Sloan Digital Sky Survey

14
Clustering Problem: Documents
Finding topics:
Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
It actually doesn’t matter if k is infinite; i.e., we don’t
limit the set of words

Documents with similar sets of words may be about the


same topic

15
Similarity Measurement
Vectors: Measure similarity by the Cosine similarity

Sets : Measure similarity by the Jaccard distance

Points: Measure similarity by Euclidean distance

16
What is Clustering?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups

Clustering (or cluster analysis, data segmentation, …)


Finding similarities between data according to the characteristics
found in the data and naturally grouping similar data objects
into clusters
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

17
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA, classification,
and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection: Outliers are often viewed as those “far away” from
any cluster
18
Basic Steps to Develop a Clustering Task
Feature selection
Select info concerning the task of interest
Minimal information redundancy
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
Validation of the results
Validation test (clustering index: PC, CE, NMI, IFV, PCAES)
Interpretation of the results
Integration with applications
19
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)

Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)

Similarity measure
Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-
based (e.g., density or contiguity)

Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
20
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality 21
1. PARTITIONING METHODS

22
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into
a set of k clusters, such that the sum of squared distances is
minimized (where ci is the centroid or medoid of cluster Ci)
E  ik1 pCi (d ( p, ci ))2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster

23
k–means Algorithm

Assumes Euclidean space/distance

Start by picking k, the number of clusters

Initialize clusters by picking one point per cluster


Example: Pick one point at random, then k-1 other
points, each as far away as possible from
the previous points

24
24
Populating Clusters
1) For each point, place it in the cluster whose
current centroid it is nearest

2) After all points are assigned, update the


locations of centroids of the k clusters

3) Reassign all points to their closest centroid


Sometimes moves points between clusters

Repeat 2 and 3 until convergence


Convergence: Points don’t move between clusters and
centroids stabilize 25
25
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid
Clusters after round 1
26
26
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid
Clusters after round 2
27
27
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid
Clusters at the end
28
28
Selecting k (number of clusters)

Elbow Method
Try different k, looking at the change in the
average distance to centroid as k increases
Average falls rapidly until right k, then changes
little
Best value
of k
Average
distance to
centroid
k

29
29
Example: Selecting k

Too few;
many long
distances
to centroid.

30
Example: Selecting k

Just right;
distances
rather short.

31
Example: Selecting k

Too many;
little improvement
in average
distance.

32
Tahapan Algoritma k-Means
1. Pilih jumlah klaster k yang diinginkan
2. Inisialisasi k pusat klaster (centroid) secara random
3. Tempatkan setiap data atau objek ke klaster terdekat. Kedekatan dua objek
ditentukan berdasar jarak. Jarak yang dipakai pada algoritma k-Means
adalah Euclidean distance (d)
n
d Euclidean x, y   
 i i  2
x y
i 1

x = x1, x2, . . . , xn, dan y = y1, y2, . . . , yn merupakan banyaknya n atribut(kolom) antara 2
record
4. Hitung kembali pusat klaster dengan keanggotaan klaster yang sekarang.
Pusat klaster adalah rata-rata (mean) dari semua data atau objek dalam
klaster tertentu
5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru. Jika
pusat klaster sudah tidak berubah lagi, maka proses pengklasteran selesai.
Atau, kembali lagi ke langkah nomor 3 sampai pusat klaster tidak berubah
lagi (stabil) atau tidak ada penurunan yang signifikan dari nilai SSE (Sum of
Squared Errors) 33
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k
nonempty subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition Update the
 Assign each object to the cluster
cluster of its nearest centroid centroids
 Until no change

34
Strength of K-Means
Strength:
1. Efficient : O(n), where n is the number of data
2. Simple implementation
3. Guarantees convergence
4. Easily adapt the new data

(Google Developers)

35
Weaknesses of K-Means
Weaknesses:
1. Not good at clusters with different densities and size
2. Choose k manually
3. Sensitive to noisy data and outliers
4. Dependent on initial values
5. Not suitable to discover clusters with non-convex shapes
6. Scaling with number of dimensions

36
Weaknesses of K-Means
Clusters with Different Densities and Sizes

Natural clusters K-Means resulted clusters

37
Weaknesses of K-Means
Clusters with Different Densities and Sizes

Solution: Use Generalized K-Means


38
Weaknesses of K-Means
Can’t handle non-convex clusters

input K-means output

39
Weaknesses of K-Means
Can’t handle non-convex clusters

Solution: Combine with other methods, such as agglomerative 40


Weaknesses of K-Means
K-means

Can’t handle non-convex clusters

41
Weaknesses of K-Means
The k-means algorithm is sensitive to outliers!
K-Medoids:
Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally
located object in a cluster

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

42
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

choose k each
6 6 6

5 5 5

4
object as 4 remainin 4

3 initial 3 g object 3

2 medoids 2 to 2

1 1 nearest 1

0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
medoids 0
0 1 2 3 4 5 6 7 8 9 10

Randomly select a
K=2 Total Cost = 26 nonmedoid object,Oramdom

Do loop
10 10

Compute
9 9

Swapping O
8 8

Until no change 7 total cost of 7

and Oramdom 6 swapping 6

If quality is
5 5

4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
43
43
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS (Ng & Han, 1994): Randomized re-sampling

44
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means

Handling categorical data: k-modes


Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical
objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype
method
45
2. HIERARCHICAL METHODS

46
Hierarchical Clustering
Use distance matrix as clustering criteria
This method does not require the number of clusters k as an
input, but needs a termination condition

Step Step Step Step Step agglomerative


0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0
47
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

48
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of


clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at


the desired level, then each connected component forms a cluster

49
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)


Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

50
Distance between Clusters X X

Single link: smallest distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an element in


the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster

51
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN 1(t
Cm  ip )
N

Radius: square root of average


distance from any point of the
 N (t  cm ) 2
cluster to its centroid Rm  i 1 ip
N

Diameter: square root of average


mean squared distance between
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
all pairs of points in the cluster
N ( N 1)

52

You might also like