You are on page 1of 19

Clustering

What is Clustering?
• Attach label to each observation or data points in a set
• You can say this “unsupervised learning”
• Clustering is alternatively called as “grouping”
• Intuitively, if you would want to assign the same label to
a data points that are “close” to each other
But, how many clusters do we have?
Example 1 Example 2

students in our school of


engineering
Application Example
Geochemical study of an impacted fluvial system
• sediments collected along the Rapel Fluvial System in Central Chile
• samples are analyzed using chemical and mineralogical analysis methods
• Features vectors are built
• Clustering
nts • Examples are shown in the map
oal
ure
and
ble
ese
cal
his
sis
and
ate
Clustering depends on the selected features

• The number of clusters (i.e. the clustering) depends on


what features are selected.
! ! !
s p x ∈ℜn c
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características

n: número de
características

Mundo
Real
But, also in the distance metric

• Thus, after features are selected, the clustering


algorithms rely on a distance metric between data points
(features)
• Sometimes, it is said that the for clustering, the distance
metric is more important than the clustering algorithm
Distances: Quantitative Variables

Data point:
xi = [ xi1 … xip ]T
Some examples
Distances: Ordinal and Categorical
Variables

• Ordinal variables can be forced to lie within (0, 1) and then


a quantitative metric can be applied:

k −1/ 2
, k = 1,2,…, M
M

• For categorical variables, distances must be specified by


user between each pair of categories.
But, in some cases to use distances can be
tricky
K-means Overview

• “K” stands for number of clusters, it is typically a user


input to the algorithm; some criteria can be used to
automatically estimate K
• It is an approximation to an NP-hard combinatorial
optimization problem
• K-means algorithm is iterative in nature
• It converges, however only a local minimum is obtained
• Works only for numerical data
• Easy to implement
K-means: Setup

• x1,…, xN are data points or vectors of observations

• Each observation (vector xi) will be assigned to one and only one cluster

• C(i) denotes cluster number for the ith observation

• Dissimilarity measure: Euclidean distance metric

• K-means minimizes within-cluster point scatter:


1 K 2 K

∑ ∑ ∑ x − xj = ∑ Nk ∑
2
W (C) = xi − mk (Exercise)
2 k =1 C (i )=k C ( j )=k i k =1 C (i )=k

where

mk is the mean vector of the kth cluster

Nk is the number of observations in kth cluster


Within and Between Cluster Criteria
Let’s consider total point scatter for a set of N data points:

1 N N
T = ∑∑ d ( xi , x j )
2 i =1 j =1
Distance between two points
T can be re-written as:
1 K
T = ∑ ∑ ( ∑ d ( xi , x j ) + ∑ d ( xi , x j ))
2 k =1 C (i ) =k C ( j ) =k C ( j )≠k

= W (C ) + B(C )

If d is square Euclidean distance, then


K
1
Where, W (C ) = ∑ ∑ ∑ d ( xi , x j ) W (C ) = ∑ N k
K

∑ x −m
2
2 k =1 C (i )=k C ( j )=k i k
k =1 C (i )=k
Within cluster 1 K
scatter B(C ) = ∑ ∑ ∑ d ( xi , x j )
2 k =1 C (i )=k C ( j )≠ k
K
and B(C ) = ∑ N k mk − m
2
Ex.
k =1

Grand mean
Between cluster
scatter
Minimizing W(C) is equivalent to maximizing B(C)
K-means Algorithm

• For a given cluster assignment C of the data points,


compute the cluster means mk:

∑x
i:C ( i ) = k
i

mk = , k = 1,…, K .
Nk

• For a current set of cluster means, assign each


observation as:
2
C (i ) = arg min xi − mk , i = 1,…, N
1≤ k ≤ K

• Iterate above two steps until convergence


K-means example 1
K-means example 2
Selecting the Number of Clusters

• In a given data distribution k-means, and in general any


clustering algorithm can find k cluster (k>1 & k<N).
• Is it not easy how to choose k!
Fuzzy c-means
Fuzzy c-means
Fuzzy c-means

You might also like