Professional Documents
Culture Documents
Lecture 2
Clustering
Data
• raw measurements
symbols, signals, . . .
• corresponding to some attributes
height, grade, heartbeat, . . .
Attribute domain
• expresses the type of an attribute
number, string, sequence, . . .
• by the set D of admissible values
• called the domain of the attribute
height up to 3 m, grade from A to F , . . .
• and certain operations allowed on D
1 < 3, “A” ≥ “C”, “Jon” 6= “John”, . . .
• w.l.o.g. D = R
Boolean attributes
• d(x, y) = |x − y|
• Be aware of the range!
• D = {0, 1}
• normalization • as nominal or ordinal
• Levenshtein distance
• Be aware of the range!
Object
• A collection of recorded measurements (attributes) representing
an entity of observation (context, meaning)
e.g a student represented by ID (nominal), age (quantitative), sex
(boolean), English proficiency (ordinal), list of absolved courses
(set), yearly scores from IQ tests (time-series), . . .
• x = (x1 , x2 , . . . , xm ) ∈ D1 × D2 × · · · × Dm
Chord distance
m
P
xi yi 1
i=1 2
• d(x, y) = 2 1 − m m 1 )
2
x2i yi2
P P
Cosine similarity
i=1 i=1
m
P
xi yi • Be aware of the range!
i=1
• s(x, y) = m m 21 kx − yk2 = (x − y)T (x − y) =
x2i yi2
P P
kxk2 + kyk2 − 2xT y = 2(1 − cos(x, y))
i=1 i=1
• Be aware of the range! if kxk2 = kyk2 = 1
Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• the number k of clusters
• kn
Pros
• Computationally efficient
• Obtains, quite often, good results, i.e. global optima
Cons
• The necessity of defining k
• Multiple runs with random initialization recommended
• Can only find partitions with convex shape
• Influence of outliers to cluster centers
Rand index
a+d
• RI = a+b+c+d Precision
a
• P = a+b
Jaccard index
a
• J = a+b+c Recall
a
• R = a+c
Could we use some measure from
Information Theory?
F-measure
k 2
• Fβ = (ββ 2+1)P.R
P P
e.g. − ♥i log ♥i . . . ? .P +R
i=1 x∈D,
p(x)=i
Silhouette
1 P
• S = |D| sil(x)
x∈D
b(x)−a(x)
• sil(x) = max{a(x),b(x)}
P
d(x,y)
y∈D,p(y)=p(x)
• a(x) = P
1
y∈D,p(y)=p(x)
• b(x) = P
n d(x,y) o
y∈D,p(y)=i
min P
1
i∈{1,2,...,k}, y∈D,p(y)=i
i6=p(x)
• sil(x) ∈ [−1, 1]
• sil(x) = 1 ⇒ x is far away from the neighboring clusters
• sil(x) = 0 ⇒ x is on the boundary between two neighboring clusters
• sil(x) = −1 ⇒ x is probably assigned to the wrong cluster
• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters
1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1
Pros
• Clusters of an arbitrary shape
• Robust to outliers
Cons
• Computationally complex
• Hard to set the parameters
tomas.horvath@inf.elte.hu