You are on page 1of 6

Exercises on

Clustering

Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU

4 February 2017

Contents

1 Introduction 2

2 Hard partitional clustering 2

2.1 K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Exercise: Stable assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Davies-Bouldin index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Soft partitional clustering 3

3.1 Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.1 Isotropic Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.3 Isotropic Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.5 Conditions for a local optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.6 Exercise: Derive a condition for local optima for the mixture-of-Gaussian model . . . 3
2016 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from
other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view
a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their own
copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free copyrights
(here usually figures I have the rights to publish but you dont, like my own published figures).
Several of my exercises (not necessarily on this topic) were inspired by papers and textbooks by other authors. Unfortunately,
I did not document that well, because initially I did not intend to make the exercises publicly available, and now I cannot trace
it back anymore. So I cannot give as much credit as I would like to. The concrete versions of the exercises are certainly my
own work, though.
These exercises complement my corresponding lecture notes available at https://www.ini.rub.de/PEOPLE/wiskott/

Teaching/Material/, where you can also find other teaching material such as programming exercises. The table of contents of
the lecture notes is reproduced here to give an orientation when the exercises can be reasonably solved. For best learning effect
I recommend to first seriously try to solve the exercises yourself before looking into the solutions.

1
3.1.7 Exercise: Derive a condition for local optima for the mixture-of-Gaussian model . . . 4

3.1.8 Exercise: Derive a condition for local optima for the mixture-of-Gaussian model . . . 4

3.1.9 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.10 Practical problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.11 Unisotropic Gaussians + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Partition coefficient index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Agglomerative hierarchical clustering 5

4.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.2 The hierarchical clustering algorithm + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.2.1 Exercise: Hierachical clustering with the single-link and the complete-link method . . 5

4.3 Validating hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Applications 6

1 Introduction

2 Hard partitional clustering

2.1 K-means algorithm

2.1.1 Exercise: Stable assignment

Given are input vectors x and reference vectors rj , which are updated according to the learning rule

ri = (x ri ) ,
rj6=i = 0

with the input vectors being presented to the system completely randomly, i.e. with no particular constraints
on their order. is a constant learning rate with 0 < < 1. Index i indicates the winning unit, for which
|x ri | < |x rj | for all j 6= i (we assume that |x ri | = |x rj6=i | does not happen).

Under which conditions does the assignment of input vectors to reference vectors not change even in the
long run? In other words, when is the assignment guaranteed to be stable?

Hint: You have to consider the possibility that the input vectors get presented to the system in an arbitrarily
unfavorable order.

2
2.2 Davies-Bouldin index

3 Soft partitional clustering

3.1 Gaussian mixture model

3.1.1 Isotropic Gaussians

3.1.2 Introduction

3.1.3 Isotropic Gaussians

3.1.4 Maximum likelihood estimation

3.1.5 Conditions for a local optimum

3.1.6 Exercise: Derive a condition for local optima for the mixture-of-Gaussian model

The Mixture-of-Gaussian model was defined in the lecture with isotropic Gaussians

kx ck k2
 
1
p(x|k) := exp , (1)
(2k2 )d/2 2k2

and an overall probability density function


X
p(x) = p(x|k)P (k) . (2)
k

The free parameters ck , k2 , and P (k) had to be chosen such that the likelihood
Y
p({xn }) = p(xn ) (3)
n

is maximized. At a local optimum in parameter space, the gradient of the objective function (3) wrt the
free parameters must vanish. I have stated in the lecture three conditions that can be derived by setting the
gradient to zero and that must be fulfilled at a local optimum. The task in this exercise is to work out the
derivation, at least partially. Proceed as follows:

1. Define a new objective function E by taking the negative logarithm of the likelihood (3), i.e. E :=
ln(p({xn })).

(a) Why might taking the logarithm be a useful transformation?


(b) Should we minimize or maximize the new objective function?
(c) Would that be truely equivalent to maximizing (3)?
2. Calculate the partial derivative of E wrt to k . Set the derivative to zero and derive a necessary
condition for k2 to be at a local optimum.

Hint 1: When considering one type of parameters (k , ck , or P (k)), assume the others are optimal already.
p(xn |k)P (k)
Hint 2: Take advantage of Bayes theorem P (k|xn ) = p(xn ) .

Hint 3: Compare your results with the equations given in the lecture.

3
3.1.7 Exercise: Derive a condition for local optima for the mixture-of-Gaussian model

The Mixture-of-Gaussian model was defined in the lecture with isotropic Gaussians

kx ck k2
 
1
p(x|k) := exp , (1)
(2k2 )d/2 2k2

and an overall probability density function


X
p(x) = p(x|k)P (k) . (2)
k

The free parameters ck , k2 , and P (k) had to be chosen such that the likelihood
Y
p({xn }) = p(xn ) (3)
n

is maximized. At a local optimum in parameter space, the gradient of the objective function (3) wrt the
free parameters must vanish. I have stated in the lecture three conditions that can be derived by setting the
gradient to zero and that must be fulfilled at a local optimum. The task in this exercise is to work out the
derivation, at least partially. Proceed as follows:

1. Define a new objective function E by taking the negative logarithm of the likelihood (3), i.e. E :=
ln(p({xn })).
(a) Why might taking the logarithm be a useful transformation?
(b) Should we minimize or maximize the new objective function?
(c) Would that be truely equivalent to maximizing (3)?
2. The P (k) are not really free parameters, because they have to fulfill the constraints

P (k) 0, (4)
X
P (k) = 1. (5)
k

Introduce auxilliary variables k and define


exp(k )
P (k) := P . (6)
l exp(l )

Show that with this definition, the probabilities P (k) assume valid values for any set of k with
< k < +.
3. Calculate the partial derivative of E wrt to k . Set the derivative to zero and derive a necessary
condition for P (l) to be at a local optimum.

Hint 1: When considering one type of parameters (k , ck , or P (k)), assume the others are optimal already.
p(xn |k)P (k)
Hint 2: Take advantage of Bayes theorem P (k|xn ) = p(xn ) .

Hint 3: Compare your results with the equations given in the lecture.

3.1.8 Exercise: Derive a condition for local optima for the mixture-of-Gaussian model

The Mixture-of-Gaussian model was defined in the lecture with isotropic Gaussians

kx ck k2
 
1
p(x|k) := exp , (1)
(2k2 )d/2 2k2

4
and an overall probability density function
X
p(x) = p(x|k)P (k) . (2)
k

The free parameters ck , k2 , and P (k) had to be chosen such that the likelihood
Y
p({xn }) = p(xn ) (3)
n

is maximized. At a local optimum in parameter space, the gradient of the objective function (3) wrt the
free parameters must vanish. I have stated in the lecture three conditions that can be derived by setting the
gradient to zero and that must be fulfilled at a local optimum. The task in this exercise is to work out the
derivation, at least partially. Proceed as follows:

1. Define a new objective function E by taking the negative logarithm of the likelihood (3), i.e. E :=
ln(p({xn })).
(a) Why might taking the logarithm be a useful transformation?
(b) Should we minimize or maximize the new objective function?
(c) Would that be truely equivalent to maximizing (3)?

2. Calculate the partial derivative of E wrt to ck , which is a gradient. Set the derivative to zero and
derive a necessary condition for ck to be at a local optimum.

Hint 1: When considering one type of parameters (k , ck , or P (k)), assume the others are optimal already.
p(xn |k)P (k)
Hint 2: Take advantage of Bayes theorem P (k|xn ) = p(xn ) .

3.1.9 EM algorithm

3.1.10 Practical problems

3.1.11 Unisotropic Gaussians +

3.2 Partition coefficient index

4 Agglomerative hierarchical clustering

4.1 Dendrograms

4.2 The hierarchical clustering algorithm +

4.2.1 Exercise: Hierachical clustering with the single-link and the complete-link method

1. Define a set of four points xn in 1D for which the single-link and the complete-link method produce
very different dendrograms. Draw the two corresponding dendrograms.
2. Define a set of four points xn in 1D for which the single-link and the complete-link method produce
similar dendrograms. Draw the corresponding dendrogram.

5
4.3 Validating hierarchical clustering

5 Applications

You might also like