You are on page 1of 39

Tomáš Horváth

INTRODUCTION TO DATA SCIENCE

Lecture 2

Clustering

Data Science and Engineering Department


Faculty of Informatics
ELTE University
Basic Concepts
Data Types & Attributes

Data
• raw measurements
symbols, signals, . . .
• corresponding to some attributes
height, grade, heartbeat, . . .

Attribute domain
• expresses the type of an attribute
number, string, sequence, . . .
• by the set D of admissible values
• called the domain of the attribute
height up to 3 m, grade from A to F , . . .
• and certain operations allowed on D
1 < 3, “A” ≥ “C”, “Jon” 6= “John”, . . .

Basic Concepts 1/34


What is clustering?

Given the data, the aim is to group objects (instances) into


so-called clusters, such that objects in the same cluster are (or, at
least, should be) more similar to each other than to the objects
belonging to other clusters

• Similarity plays an important role in clustering!

Basic Concepts 2/34


Similarity of Attribute Values
• s(x, y) ∈ [0, 1] for x, y ∈ D
• the opposite to dissimilarity computed as the difference d(x, y)
• s(x, y) = 1 − d(x, y)

Nominal attributes Ordinal attributes


• w.l.o.g. D = {1, 2, . . . , n} • w.l.o.g. D = {1, 2, . . . , n}
• x, y ∈ D are symbols • x, y ∈ D are ranks

• s(x, y) =
1, if x = y • d(x, y) = |x−y|
n−1
0, if x 6= y
D = {worst,bad,neutral,good,best}
|2−4|
Quantitative attributes d(bad,good) = 4
= 0.5

• w.l.o.g. D = R
Boolean attributes
• d(x, y) = |x − y|
• Be aware of the range!
• D = {0, 1}
• normalization • as nominal or ordinal

Basic Concepts 3/34


Similarity of Attribute Values
Set attributes T i m i
• w.l.o.g. D = P({1, 2, . . . , n}) \ ∅ 0 1 2 3 4
T 1 0 1 2 3
• s(x, y) = |x ∩ y| |x ∩ y|
|x ∪ y| , s(x, y) = min{|x|,|y|} o 2 1 1 2 3
• Jaccard index, Overlap m 3 2 2 1 2

Sequence attributes (strings)


• w.l.o.g. D = {1, 2, . . . , n}<N
• d(x, y) = dx,y (|x|, |y|)


 max{i, j} , if min{i, j} = 0


 
• dx,y (i, j) =  dx,y (i − 1, j) + 1
min dx,y (i, j − 1) + 1 , otherwise




dx,y (i − 1, j − 1) + 1xi 6=yj
 

• Levenshtein distance
• Be aware of the range!

Basic Concepts 4/34


Similarity of Attribute Values

• For longer strings, other similarity measures could be beneficial


• longest common substring or subsequence, . . .
• How would you compute the similarity of two texts?
Will talk about it later in this course. . .

Sequence attributes (time series)


• w.l.o.g. D = R<N
• d(x, y) = dx,y (|x|, |y|)


 0  , if i + j = 0
 dx,y (i − 1, j)



• dx,y (i, j) = |xi − yj | + min dx,y (i, j − 1) , if i.j > 0
d (i − 1, j + 1)
 
x,y



∞ , otherwise

• Dynamic Time Warping distance
• Be aware of the range!

Basic Concepts 5/34


Similarity of Attribute Values

Illustration of Dynamic Time Warping

Basic Concepts 6/34


Similarity of Attribute Values
The Basic DTW Algorithm

1: procedure DTW(x = (x1 , x2 . . . , xp ), y = (y1 , y2 . . . , yq ))


2: dx,y ← R(p+1)×(q+1) . cost matrix dx,y
3: for all i ∈ {1, 2, . . . , p} do
4: d( x, y)(i, 0) ← ∞
5: for all j ∈ {1, 2, . . . , q} do
6: dx,y (0, j) ← ∞
7: dx,y (0, 0) ← 0
8: for i = 1 → p do
9: for j = 1 → q do
10: d ← |xi − yj | . distance of xi and yj
11:
dx,y (i, j) ← d + min{dx,y (i − 1, j), dx,y (i, j − 1), dx,y (i − 1, j − 1)}
12: return dx,y (p, q)

Basic Concepts 7/34


Objects, Records, Observations

Object
• A collection of recorded measurements (attributes) representing
an entity of observation (context, meaning)
e.g a student represented by ID (nominal), age (quantitative), sex
(boolean), English proficiency (ordinal), list of absolved courses
(set), yearly scores from IQ tests (time-series), . . .
• x = (x1 , x2 , . . . , xm ) ∈ D1 × D2 × · · · × Dm

• Objects with mixed types of attributes can be transformed


to objects having boolean or/and quantitative attribute types
• Be aware of the possible loss of information!
• Can you propose some approaches to such transformation?

Basic Concepts 8/34


Similarity of Binary Instances
Contingency table Treating a and d equally
• x = (x1 , x2 , . . . , xm ) • s(x, y) = a+d
m
• y = (y1 , y2 , . . . , ym ) • Simple matching

x • d(x, y) = b + c
Sum
1 0 • Euclidean distance
1 a b a+b • Be aware of the range!
y
0 c d c+d
Treating a and d unequally
Sum a+c b+d m
Pm • s(x, y) = a+d/2
m
• a= i=1 1xi =1=yi • Faith’s similarity
Pm
• b= 10=xi 6=yi =1
Pi=1
m
Ignoring d
• c= 11=xi 6=yi =0 a
Pi=1
m
• s(x, y) = a+b+c
• d = i=1 1xi =0=yi • Jaccard index

x = (0, 1, 0, 1, 0, 1), y = (0, 1, 1, 1, 1, 0), a = 2, b = 2, c = 1, d = 1


Basic Concepts 9/34
Similarity of Numerical Instances
Objects are points in an m-dimensional Euclidean space
Minkowski distance
m
P 1
r r
• d(x, y) = |xi − yi |
i=1
• Manhattan distance (r = 1)
• Euclidean distance (r = 2)
• Be aware of the range!

Chord distance
m
P
 xi yi 1
i=1 2
• d(x, y) = 2 1 − m m 1 )
2
x2i yi2
P P
Cosine similarity
i=1 i=1
m
P
xi yi • Be aware of the range!
i=1
• s(x, y) = m m  21 kx − yk2 = (x − y)T (x − y) =
x2i yi2
P P
kxk2 + kyk2 − 2xT y = 2(1 − cos(x, y))
i=1 i=1
• Be aware of the range! if kxk2 = kyk2 = 1

Basic Concepts 10/34


Similarity of Nominal, Ordinal and Mixed Instances
Ordinal Instances
m−1 m
P P
ox y
ij oij
Nominal Instances
i=1 j=i+1 m
• s(x, y) =
P
m−1 m
1xi =yi
y i=1
|ox • s(x, y) =
P P
ij | |oij | m
i=1 j=i+1

 1 , if xi > xj
• ox = −1, if xi < xj Mixed Instances
ij m
P
0 , if xi = xj wi s(xi ,yi )

i=1
• s(x, y) =
• oyij
defined as oxij m
P
wi
i=1
• Goodman & Kruskal
• Gower’s index
• Be aware of the range! 
1, if xi 6= NA 6= yi
• wi =
s(x = (1, 2, 3), y = (1, 2, 3)) = 0, otherwise
(−1).(−1)+(−1).(−1)+(−1).(−1)
3
= 3
3
=1 • s(xi , yi ) is a suitable
s(x = (1, 2, 3), y = (3, 2, 1)) = attribute similarity measure
(−1).1+(−1).1+(−1).1
3
= −3
3
= −1

Basic Concepts 11/34


Use-cases

related to clustering, grouping


• Ethnographers would like to create a hierarchy of villages in a
broader region such that strongly related regions according to
similarity of their folk heritage are at lower levels.
• Marketers would like to divide a broad target market into smaller
subsets of customers with similar characteristics in order to
estimate their needs and interests.
• Biologists would like to know densely populated clusters of a
certain plant in the forest based on satellite images.

Basic Concepts 12/34


An old classic. . .
The Iris dataset
• Iris plants of the class Setosa, Versicolour, Virginica
• 150 instances, 4 attributes
• sepal length and width in cm, petal length and width in cm

Basic Concepts 13/34


Hierarchical Agglomerative Clustering
Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• linkage criterion
• the distance measure between A, B ⊂ D
• single linkage
• l(A, B) = min{d(a, b) | a ∈ A, b ∈ B}
• complete linkage
• l(A, B) = max{d(a, b) | a ∈ A, b ∈ B}
• average linkage
1
• l(A, B) = |A||B|
P P
d(a, b)
a∈A b∈B

Clusters – Connection-based 14/34


Hierarchical Agglomerative Clustering

the goal is to find


• clusterings C1 , C2 , . . . , C|D| ⊂ P(D) \ ∅ of objects in D such that
• C1 = {{x1 }, {x2 }, . . . , {x|D| }}
• initially, each object is in a separate cluster
• and for each i ∈ {2, . . . , k}
• Ci = (Ci−1 \ {A? , B ? }) ∪ (A? ∪ B ? )
• A? , B ? ∈ Ci−1 and l(A? , B ? ) = min{l(A, B) | A, B ∈ Ci−1 }

Thus, in each step i ∈ {2, . . . , k}


• |Ci | − |Ci−1 | = −1
• two closest clusters are removed, merged and added as new cluster
• each item is assigned exactly to one cluster

Clusters – Connection-based 15/34


Dendrograms
Single linkage Complete linkage

Cluster Set. Vers. Virg. Cluster Set. Vers. Virg.


Cut at 2 clusters Cut at 2 clusters
1 50 0 0 1 50 29 0
2 0 50 50 2 0 21 50
Cut at 3 clusters Cut at 3 clusters
1 50 0 0 1 50 0 0
2 0 49 50 2 0 21 50
3 0 1 0 3 0 29 0

Clusters – Connection-based 16/34


Dendrograms
Average linkage
Pros of Aggl. Clustering
• easily interpretable
• setting of the parameters is
not hard

Cluster Set. Vers. Virg. Cons of Aggl. Clustering


Cut at 2 clusters • computationally complex
1 50 0 0 • subjective interpretation of
2 0 50 50 dendrograms
Cut at 3 clusters
• obtain quite often local
1 50 0 0
optima
2 0 45 1
3 0 5 49

Clusters – Connection-based 17/34


k-Means Clustering

Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• the number k of clusters
• kn

the goal is to find


• cluster centers c1 , c2 , . . . , ck
• and a mapping p : D → {1, 2, . . . , k} such that
n
X
d(xi , cp(xi ) ) is minimal
i=1

Clusters – Prototype-based 18/34


k-Means Clustering
The algorithm
1 Initialize c1 , c2 , . . . , ck such that for all i = {1, 2, . . . , k}
• ci ∈ D (random initialization), or
P
x
• ci = Px,p(x)=i for a random mapping p (random partition)
x,p(x)=i 1

2 compute p such that


n
P
• d(xi , cp(xi ) ) is minimal
i=1

3 update ci for all i = {1, 2, . . . , k} such that


P
x
x,p(x)=i
• ci = P
1
x,p(x)=i

4 if p or ci for some i = {1, 2, . . . , k} were changed then goto step 2


5 return p and c1 , c2 , . . . , ck

Clusters – Prototype-based 19/34


Means vs. Medoids

Clusters – Prototype-based 20/34


k-Means “Good to know”

Pros
• Computationally efficient
• Obtains, quite often, good results, i.e. global optima

Cons
• The necessity of defining k
• Multiple runs with random initialization recommended
• Can only find partitions with convex shape
• Influence of outliers to cluster centers

Clusters – Prototype-based 21/34


External Evaluation of Clusters

• Class labels of instances are known Object pairs Class


in the same Yes No
• e.g. Setosa, Versicolor, Virginica
Yes a b
• based on contingency table Cluster
No c d

Rand index
a+d
• RI = a+b+c+d Precision
a
• P = a+b
Jaccard index
a
• J = a+b+c Recall
a
• R = a+c
Could we use some measure from
Information Theory?
F-measure
k 2
• Fβ = (ββ 2+1)P.R
P P
e.g. − ♥i log ♥i . . . ? .P +R
i=1 x∈D,
p(x)=i

Clusters – Evaluation 22/34


Internal Evaluation of Clusters

Silhouette
1 P
• S = |D| sil(x)
x∈D
b(x)−a(x)
• sil(x) = max{a(x),b(x)}
P
d(x,y)
y∈D,p(y)=p(x)
• a(x) = P
1
y∈D,p(y)=p(x)

• b(x) = P
n d(x,y) o
y∈D,p(y)=i
min P
1
i∈{1,2,...,k}, y∈D,p(y)=i
i6=p(x)

• sil(x) ∈ [−1, 1]
• sil(x) = 1 ⇒ x is far away from the neighboring clusters
• sil(x) = 0 ⇒ x is on the boundary between two neighboring clusters
• sil(x) = −1 ⇒ x is probably assigned to the wrong cluster

Clusters – Evaluation 23/34


Internal Evaluation of Clusters

Clusters – Evaluation 24/34


Internal Evaluation of Clusters
Within sum group of squares
k
kx − ci k2
P P
• W =
i=1 x∈D,
p(x)=i

Clusters – Evaluation 25/34


Non-convex Clusters

Clusters – Density-based 26/34


Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some  and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point

• p is density-reachable from q w.r.t. some  and δ if


• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n

• p is density-connected to q w.r.t. some  and δ if


• ∃o ∈ D such that both p and q are density-reachable from o

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some  and δ if


• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q

• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters

Clusters – Density-based 27/34


Neighborhood and Reachability

Clusters – Density-based 28/34


DBSCAN

1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1

Clusters – Density-based 29/34


DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
Clusters – Density-based 30/34
How to guess  and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor

Clusters – Density-based 31/34


DBSCAN – “good to know”

Pros
• Clusters of an arbitrary shape
• Robust to outliers

Cons
• Computationally complex
• Hard to set the parameters

Clusters – Density-based 32/34


Final remarks

• domain knowledge might help in choosing the right similarity


measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate

• there are various other approaches to similarity computation


• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.

Clusters – Density-based 33/34


Thanks for your attention
References

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On


Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).


Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal


component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A


density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.

Clusters – Density-based 34/34


Homework

• Download a clustering dataset from the UCI Machine Learning


Repository

• Cluster the dataset using


• Agglomerative clustering
• k-means method
• DBSCAN method

• Justify the choice of the values for the hyper-parameters


• similarity, linkage, k, δ, , . . .

Clusters – Density-based 34/34


Questions?

tomas.horvath@inf.elte.hu

You might also like