Introduction To Data Science: Tom A S Horv Ath

Tomáš Horváth
INTRODUCTION TO DATA SCIENCE
Lecture 2
Clustering
Data Science and Engineering Department

Faculty of Informatics
ELTE University
Basic Concepts
Data Types & Attributes
Data
• raw measurements
symbols, signals, . . .
• corresponding to some attributes
height, grade, heartbeat, . . .
Attribute domain
• expresses the type of an attribute
number, string, sequence, . . .
• by the set D of admissible values
• called the domain of the attribute
height up to 3 m, grade from A to F , . . .
• and certain operations allowed on D
1 < 3, “A” ≥ “C”, “Jon” 6= “John”, . . .
Basic Concepts 1/34

What is clustering?
Given the data, the aim is to group objects (instances) into

so-called clusters, such that objects in the same cluster are (or, at
least, should be) more similar to each other than to the objects
belonging to other clusters
• Similarity plays an important role in clustering!
Basic Concepts 2/34

Similarity of Attribute Values
• s(x, y) ∈ [0, 1] for x, y ∈ D
• the opposite to dissimilarity computed as the difference d(x, y)
• s(x, y) = 1 − d(x, y)
Nominal attributes Ordinal attributes

• w.l.o.g. D = {1, 2, . . . , n} • w.l.o.g. D = {1, 2, . . . , n}
• x, y ∈ D are symbols • x, y ∈ D are ranks

• s(x, y) =
1, if x = y • d(x, y) = |x−y|
n−1
0, if x 6= y
D = {worst,bad,neutral,good,best}
|2−4|
Quantitative attributes d(bad,good) = 4
= 0.5
• w.l.o.g. D = R
Boolean attributes
• d(x, y) = |x − y|
• Be aware of the range!
• D = {0, 1}
• normalization • as nominal or ordinal
Basic Concepts 3/34

Set attributes T i m i
• w.l.o.g. D = P({1, 2, . . . , n}) \ ∅ 0 1 2 3 4
T 1 0 1 2 3
• s(x, y) = |x ∩ y| |x ∩ y|
|x ∪ y| , s(x, y) = min{|x|,|y|} o 2 1 1 2 3
• Jaccard index, Overlap m 3 2 2 1 2
Sequence attributes (strings)

• w.l.o.g. D = {1, 2, . . . , n}<N
• d(x, y) = dx,y (|x|, |y|)


 max{i, j} , if min{i, j} = 0


 
• dx,y (i, j) =  dx,y (i − 1, j) + 1
min dx,y (i, j − 1) + 1 , otherwise




dx,y (i − 1, j − 1) + 1xi 6=yj
 
• Levenshtein distance
Basic Concepts 4/34

• For longer strings, other similarity measures could be beneficial

• longest common substring or subsequence, . . .
• How would you compute the similarity of two texts?
Will talk about it later in this course. . .
Sequence attributes (time series)

• w.l.o.g. D = R<N
• d(x, y) = dx,y (|x|, |y|)


 0  , if i + j = 0
 dx,y (i − 1, j)



• dx,y (i, j) = |xi − yj | + min dx,y (i, j − 1) , if i.j > 0
d (i − 1, j + 1)
 
x,y



∞ , otherwise

• Dynamic Time Warping distance
Basic Concepts 5/34

Illustration of Dynamic Time Warping
Basic Concepts 6/34

The Basic DTW Algorithm
1: procedure DTW(x = (x1 , x2 . . . , xp ), y = (y1 , y2 . . . , yq ))

2: dx,y ← R(p+1)×(q+1) . cost matrix dx,y
3: for all i ∈ {1, 2, . . . , p} do
4: d( x, y)(i, 0) ← ∞
5: for all j ∈ {1, 2, . . . , q} do
6: dx,y (0, j) ← ∞
7: dx,y (0, 0) ← 0
8: for i = 1 → p do
9: for j = 1 → q do
10: d ← |xi − yj | . distance of xi and yj
11:
dx,y (i, j) ← d + min{dx,y (i − 1, j), dx,y (i, j − 1), dx,y (i − 1, j − 1)}
12: return dx,y (p, q)
Basic Concepts 7/34

Objects, Records, Observations
Object
• A collection of recorded measurements (attributes) representing
an entity of observation (context, meaning)
e.g a student represented by ID (nominal), age (quantitative), sex
(boolean), English proficiency (ordinal), list of absolved courses
(set), yearly scores from IQ tests (time-series), . . .
• x = (x1 , x2 , . . . , xm ) ∈ D1 × D2 × · · · × Dm
• Objects with mixed types of attributes can be transformed

to objects having boolean or/and quantitative attribute types
• Be aware of the possible loss of information!
• Can you propose some approaches to such transformation?
Basic Concepts 8/34

Similarity of Binary Instances
Contingency table Treating a and d equally
• x = (x1 , x2 , . . . , xm ) • s(x, y) = a+d
m
• y = (y1 , y2 , . . . , ym ) • Simple matching
√
x • d(x, y) = b + c
Sum
1 0 • Euclidean distance
1 a b a+b • Be aware of the range!
y
0 c d c+d
Treating a and d unequally
Sum a+c b+d m
Pm • s(x, y) = a+d/2
m
• a= i=1 1xi =1=yi • Faith’s similarity
Pm
• b= 10=xi 6=yi =1
Pi=1
m
Ignoring d
• c= 11=xi 6=yi =0 a
Pi=1
m
• s(x, y) = a+b+c
• d = i=1 1xi =0=yi • Jaccard index
x = (0, 1, 0, 1, 0, 1), y = (0, 1, 1, 1, 1, 0), a = 2, b = 2, c = 1, d = 1

Basic Concepts 9/34
Similarity of Numerical Instances
Objects are points in an m-dimensional Euclidean space
Minkowski distance
m
P 1
r r
• d(x, y) = |xi − yi |
i=1
• Manhattan distance (r = 1)
• Euclidean distance (r = 2)
Chord distance
m
P
xi yi 1
i=1 2
• d(x, y) = 2 1 − m m 1 )
2
x2i yi2
P P
Cosine similarity
i=1 i=1
m
P
xi yi • Be aware of the range!
i=1
• s(x, y) = m m 21 kx − yk2 = (x − y)T (x − y) =
x2i yi2
P P
kxk2 + kyk2 − 2xT y = 2(1 − cos(x, y))
i=1 i=1
• Be aware of the range! if kxk2 = kyk2 = 1
Basic Concepts 10/34

Similarity of Nominal, Ordinal and Mixed Instances
Ordinal Instances
m−1 m
P P
ox y
ij oij
Nominal Instances
i=1 j=i+1 m
• s(x, y) =
P
m−1 m
1xi =yi
y i=1
|ox • s(x, y) =
P P
ij | |oij | m
i=1 j=i+1

 1 , if xi > xj
• ox = −1, if xi < xj Mixed Instances
ij m
P
0 , if xi = xj wi s(xi ,yi )

i=1
• s(x, y) =
• oyij
defined as oxij m
P
wi
i=1
• Goodman & Kruskal
• Gower’s index
1, if xi 6= NA 6= yi
• wi =
s(x = (1, 2, 3), y = (1, 2, 3)) = 0, otherwise
(−1).(−1)+(−1).(−1)+(−1).(−1)
3
= 3
3
=1 • s(xi , yi ) is a suitable
s(x = (1, 2, 3), y = (3, 2, 1)) = attribute similarity measure
(−1).1+(−1).1+(−1).1
3
= −3
3
= −1

Use-cases
related to clustering, grouping

• Ethnographers would like to create a hierarchy of villages in a
broader region such that strongly related regions according to
similarity of their folk heritage are at lower levels.
• Marketers would like to divide a broad target market into smaller
subsets of customers with similar characteristics in order to
estimate their needs and interests.
• Biologists would like to know densely populated clusters of a
certain plant in the forest based on satellite images.

An old classic. . .
The Iris dataset
• Iris plants of the class Setosa, Versicolour, Virginica
• 150 instances, 4 attributes
• sepal length and width in cm, petal length and width in cm

Hierarchical Agglomerative Clustering
Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• linkage criterion
• the distance measure between A, B ⊂ D
• single linkage
• l(A, B) = min{d(a, b) | a ∈ A, b ∈ B}
• complete linkage
• l(A, B) = max{d(a, b) | a ∈ A, b ∈ B}
• average linkage
1
• l(A, B) = |A||B|
P P
d(a, b)
a∈A b∈B
Clusters – Connection-based 14/34

Hierarchical Agglomerative Clustering
the goal is to find

• clusterings C1 , C2 , . . . , C|D| ⊂ P(D) \ ∅ of objects in D such that
• C1 = {{x1 }, {x2 }, . . . , {x|D| }}
• initially, each object is in a separate cluster
• and for each i ∈ {2, . . . , k}
• Ci = (Ci−1 \ {A? , B ? }) ∪ (A? ∪ B ? )
• A? , B ? ∈ Ci−1 and l(A? , B ? ) = min{l(A, B) | A, B ∈ Ci−1 }
Thus, in each step i ∈ {2, . . . , k}

• |Ci | − |Ci−1 | = −1
• two closest clusters are removed, merged and added as new cluster
• each item is assigned exactly to one cluster

Dendrograms
Single linkage Complete linkage
Cluster Set. Vers. Virg. Cluster Set. Vers. Virg.

Cut at 2 clusters Cut at 2 clusters
1 50 0 0 1 50 29 0
2 0 50 50 2 0 21 50
Cut at 3 clusters Cut at 3 clusters
1 50 0 0 1 50 0 0
2 0 49 50 2 0 21 50
3 0 1 0 3 0 29 0

Dendrograms
Average linkage
Pros of Aggl. Clustering
• easily interpretable
• setting of the parameters is
not hard
Cluster Set. Vers. Virg. Cons of Aggl. Clustering

Cut at 2 clusters • computationally complex
1 50 0 0 • subjective interpretation of
2 0 50 50 dendrograms
Cut at 3 clusters
• obtain quite often local
1 50 0 0
optima
2 0 45 1
3 0 5 49

k-Means Clustering
Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• the number k of clusters
• kn
the goal is to find

• cluster centers c1 , c2 , . . . , ck
• and a mapping p : D → {1, 2, . . . , k} such that
n
X
d(xi , cp(xi ) ) is minimal
i=1
Clusters – Prototype-based 18/34

k-Means Clustering
The algorithm
1 Initialize c1 , c2 , . . . , ck such that for all i = {1, 2, . . . , k}
• ci ∈ D (random initialization), or
P
x
• ci = Px,p(x)=i for a random mapping p (random partition)
x,p(x)=i 1
2 compute p such that

n
P
• d(xi , cp(xi ) ) is minimal
i=1
3 update ci for all i = {1, 2, . . . , k} such that

P
x
x,p(x)=i
• ci = P
1
x,p(x)=i
4 if p or ci for some i = {1, 2, . . . , k} were changed then goto step 2

5 return p and c1 , c2 , . . . , ck

Means vs. Medoids

k-Means “Good to know”
Pros
• Computationally efficient
• Obtains, quite often, good results, i.e. global optima
Cons
• The necessity of defining k
• Multiple runs with random initialization recommended
• Can only find partitions with convex shape
• Influence of outliers to cluster centers

External Evaluation of Clusters
• Class labels of instances are known Object pairs Class

in the same Yes No
• e.g. Setosa, Versicolor, Virginica
Yes a b
• based on contingency table Cluster
No c d
Rand index
a+d
• RI = a+b+c+d Precision
a
• P = a+b
Jaccard index
a
• J = a+b+c Recall
a
• R = a+c
Could we use some measure from
Information Theory?
F-measure
k 2
• Fβ = (ββ 2+1)P.R
P P
e.g. − ♥i log ♥i . . . ? .P +R
i=1 x∈D,
p(x)=i
Clusters – Evaluation 22/34

Internal Evaluation of Clusters
Silhouette
1 P
• S = |D| sil(x)
x∈D
b(x)−a(x)
• sil(x) = max{a(x),b(x)}
P
d(x,y)
y∈D,p(y)=p(x)
• a(x) = P
1
y∈D,p(y)=p(x)
• b(x) = P
n d(x,y) o
y∈D,p(y)=i
min P
1
i∈{1,2,...,k}, y∈D,p(y)=i
i6=p(x)
• sil(x) ∈ [−1, 1]
• sil(x) = 1 ⇒ x is far away from the neighboring clusters
• sil(x) = 0 ⇒ x is on the boundary between two neighboring clusters
• sil(x) = −1 ⇒ x is probably assigned to the wrong cluster


Within sum group of squares
k
kx − ci k2
P P
• W =
i=1 x∈D,
p(x)=i

Non-convex Clusters
Clusters – Density-based 26/34

Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point
• p is density-reachable from q w.r.t. some and δ if

• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n
• p is density-connected to q w.r.t. some and δ if

• ∃o ∈ D such that both p and q are density-reachable from o
• C ⊆ D (C 6= ∅) is a cluster w.r.t. some and δ if

• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q
• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters

Neighborhood and Reachability

DBSCAN
1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1

DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
How to guess and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor

DBSCAN – “good to know”
Pros
• Clusters of an arbitrary shape
• Robust to outliers
Cons
• Computationally complex
• Hard to set the parameters

Final remarks
• domain knowledge might help in choosing the right similarity

measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate
• there are various other approaches to similarity computation

• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.

Thanks for your attention
References
• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On

Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.
• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).

Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.
• Chris Ding and Xiaofeng He (2004). K-means clustering via principal

component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.
• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A

density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.

Homework
• Download a clustering dataset from the UCI Machine Learning

Repository
• Cluster the dataset using

• Agglomerative clustering
• k-means method
• DBSCAN method
• Justify the choice of the values for the hyper-parameters

• similarity, linkage, k, δ, , . . .

Questions?
tomas.horvath@inf.elte.hu

Introduction To Data Science: Tom A S Horv Ath

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science: Tom A S Horv Ath

Uploaded by

Copyright:

Available Formats

Tomáš Horváth

INTRODUCTION TO DATA SCIENCE

Data Science and Engineering Department

Basic Concepts 1/34

Given the data, the aim is to group objects (instances) into

• Similarity plays an important role in clustering!

Basic Concepts 2/34

Nominal attributes Ordinal attributes

Basic Concepts 3/34

Sequence attributes (strings)

Basic Concepts 4/34

• For longer strings, other similarity measures could be beneficial

Sequence attributes (time series)

Basic Concepts 5/34

Illustration of Dynamic Time Warping

Basic Concepts 6/34

1: procedure DTW(x = (x1 , x2 . . . , xp ), y = (y1 , y2 . . . , yq ))

Basic Concepts 7/34

• Objects with mixed types of attributes can be transformed

Basic Concepts 8/34

x = (0, 1, 0, 1, 0, 1), y = (0, 1, 1, 1, 1, 0), a = 2, b = 2, c = 1, d = 1

Basic Concepts 10/34

Basic Concepts 11/34

related to clustering, grouping

Basic Concepts 12/34

Basic Concepts 13/34

Clusters – Connection-based 14/34

the goal is to find

Thus, in each step i ∈ {2, . . . , k}

Clusters – Connection-based 15/34

Cluster Set. Vers. Virg. Cluster Set. Vers. Virg.

Clusters – Connection-based 16/34

Cluster Set. Vers. Virg. Cons of Aggl. Clustering

Clusters – Connection-based 17/34

the goal is to find

Clusters – Prototype-based 18/34

2 compute p such that

3 update ci for all i = {1, 2, . . . , k} such that

4 if p or ci for some i = {1, 2, . . . , k} were changed then goto step 2

Clusters – Prototype-based 19/34

Clusters – Prototype-based 20/34

Clusters – Prototype-based 21/34

• Class labels of instances are known Object pairs Class

Clusters – Evaluation 22/34

Clusters – Evaluation 23/34

Clusters – Evaluation 24/34

Clusters – Evaluation 25/34

Clusters – Density-based 26/34

• p is density-reachable from q w.r.t. some  and δ if

• p is density-connected to q w.r.t. some  and δ if

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some  and δ if

Clusters – Density-based 27/34

Clusters – Density-based 28/34

Clusters – Density-based 29/34

Clusters – Density-based 31/34

Clusters – Density-based 32/34

• domain knowledge might help in choosing the right similarity

• there are various other approaches to similarity computation

Clusters – Density-based 33/34

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A

• p is density-reachable from q w.r.t. some and δ if

• p is density-connected to q w.r.t. some and δ if

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some and δ if