You are on page 1of 102

Prediction methods and Machine learning

Session VI

Pierre Michel
pierre.michel@univ-amu.fr

M2 EBDS

2021
Pretty parametrization and visualization of a Neural
Network and applications in Econometrics

Playground Tensorflow
Let’s also check the work of Loann Desboulets (PhD Student in Economet-
rics, AMSE).

Pierre Michel Prediction methods and Machine learning 2/102


1. Complements on Unsupervised Learning

1. Complements on Unsupervised Learning

Pierre Michel Prediction methods and Machine learning 3/102


1. Complements on Unsupervised Learning
1.1. Supervised versus Unsupervised Learning

1.1 Supervised versus Unsupervised Learning

Pierre Michel Prediction methods and Machine learning 4/102


1. Complements on Unsupervised Learning
1.1. Supervised versus Unsupervised Learning

Supervised versus Unsupervised Learning

What are the differences ?


• Supervised and unsupervised Learning share many features within the
algorithms they use to estimate a model.
• Supervised Learning: a target variable Y is available (regression:
Y ∈ R, classification: Y ∈ {1, ...K}).
• Unsupervised Learning (no target variable Y ) concerns Density
Estimation and Clustering.
Note: in many cases, density estimation resolves clustering.

Pierre Michel Prediction methods and Machine learning 5/102


1. Complements on Unsupervised Learning
1.1. Supervised versus Unsupervised Learning

Supervised learning
 (1) (1)
Training set: (x , y ), (x(2) , y (2) ), (x(3) , y (3) ), ..., (x(m) , y (m) )
3.0
2.0
x2

1.0
0.0

Pierre Michel
0.0 0.5 1.0 1.5 2.0Prediction2.5 3.0
methods and Machine learning 6/102
1. Complements on Unsupervised Learning
1.1. Supervised versus Unsupervised Learning

Unsupervised learning
 (1) (2) (3)
Training set: x , x , x , ..., x(m)
3.0
2.0
x2

1.0
0.0

Pierre Michel
0.0 0.5 1.0 1.5 2.0Prediction2.5 3.0
methods and Machine learning 7/102
1. Complements on Unsupervised Learning
1.2. Density estimation

1.2 Density estimation

Pierre Michel Prediction methods and Machine learning 8/102


1. Complements on Unsupervised Learning
1.2. Density estimation

Density estimators

Density estimation is the unsupervised version of regression.


The goal is to estimate the density of a random variable or random
vector x given n i.i.d observations of this random variable.
• Parametric estimators: often considers a mixture of densities,
assume a specific form for the density, and estimate parameters using
Maximum Likelihood or Expectation Maximization.

K
X
f (x) = αk fk (x)
k=1

Pierre Michel Prediction methods and Machine learning 9/102


1. Complements on Unsupervised Learning
1.2. Density estimation

Parametric estimator: monovariable example (2 densities)

One density Mixture of two densities

200
250
200

150
Frequency

150

Frequency

100
100

50
50
0

0
0 2 4 6 8 10 0 2 4 6 8 10

Variable Variable

This problem can also be interpreted as a 2-cluster clustering task (see


further).
Pierre Michel Prediction methods and Machine learning 10/102
1. Complements on Unsupervised Learning
1.2. Density estimation

Monovariable example in Python: dealing with histograms


import numpy as np
from matplotlib import pyplot as plt

bins = np.linspace(0, 8, 100)

x1 = np.random.normal(4, 1, 1000)
x2 = np.random.normal(6, 1, 1000)

plt.hist(x1, bins, alpha=0.5, label='x1')


plt.hist(x2, bins, alpha=0.5, label='x2')
plt.legend(loc='upper right')
plt.show()

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)


ax1.hist(np.append(x1,x2), bins, alpha=0.5, label='x')
ax2.hist(x1, bins, alpha=0.5, label='x1')
ax2.hist(x2, bins, alpha=0.5, label='x2')
Pierre Michel Prediction methods and Machine learning 11/102
1. Complements on Unsupervised Learning
1.2. Density estimation

Monovariable example in Python: dealing with histograms

Pierre Michel Prediction methods and Machine learning 12/102


1. Complements on Unsupervised Learning
1.2. Density estimation

Density estimators
• Non-parametric estimators: using histograms or more generally
Kernel Density Estimators, select a kernel function κ and
consider the following estimator:

m
x − x(i)
 
1 X
f (x) = κ
mh i=1 h

where h is the bandwidth parameter, controlling the width of the breaks


of the histogram (or kernels).
You should make vary the value of the argument breaks in order to test
different estimators.
• Ensemble methods: averaging, boosting, stacking (among others).
• Recent approaches: see the work of Mathias Bourel about boosting
approaches for recent examples. . .
Pierre Michel Prediction methods and Machine learning 13/102
1. Complements on Unsupervised Learning
1.3. Clustering

1.3 Clustering

Pierre Michel Prediction methods and Machine learning 14/102


1. Complements on Unsupervised Learning
1.3. Clustering

What is clustering ?

Clustering aims to partition the data. The data represents a set of


unlabeled observations.
We have seen that some supervised methods are also based on partition-
ing the feature space (CART and its extensions, SVM, linear discriminant
analysis. . . ).
Unsupervised partitioning (clustering) may be:
• hierarchical: AHC (agglomerative hierarchical clustering), CUBT. . .
• partitional non-hierarchical: K-means. . .
• density-based: mixture models, DBSCAN. . .
• soft/fuzzy: c-means, Latent Dirichlet Allocation (LDA, specific to
topic modelling). . .

Pierre Michel Prediction methods and Machine learning 15/102


1. Complements on Unsupervised Learning
1.3. Clustering

Different applications of clustering

• Market segmentation
• Clinical medicine
• Social networtk analysis
• Cluster computing
• Astronomical data analysis
• Genetical data analysis
• ...

Pierre Michel Prediction methods and Machine learning 16/102


1. Complements on Unsupervised Learning
1.3. Clustering

Clutering: which type of data can you cluster ?

Any type of data can be clustered.


The data can be of different types:
• continous (quantitative)
• ordinal (quantitative)
• nominal (qualitative)
• longitudinal (time-series)
• ...
Most widely used method: K-means (MacQueen’s algorithm), based on
a distance matrix adapted to the data you use.

Pierre Michel Prediction methods and Machine learning 17/102


1. Complements on Unsupervised Learning
1.3. Clustering

Dissimilarity measures

Consider a set of m observations (x(1) , x(2) , ..., x(m) ), ∀i, x(i) ∈ Rn


A dissimilarity measure verifies the following properties, ∀i, j, k ∈
{1, ..., m}:
1. d(x(i) , x(i) ) = 0
2. d(x(i) , x(j) ) = d(x(j) , x(i) )
3. d(x(i) , x(j) ) = 0 =⇒ x(i) = x(j) (dissimilarity)
4. d(x(i) , x(j) ) ≤ d(x(i) , x(k) ) + d(x(k) , x(j) ) (distance)
5. d(x(i) , x(j) ) ≤ max(d(x(i) , x(k) ), d(x(k) , x(j) )) (ultrametric distance)

Pierre Michel Prediction methods and Machine learning 18/102


1. Complements on Unsupervised Learning
1.3. Clustering

Choosing a dissimilarity measure

• Depends on variables’ type. Generally we use the euclidean


distance, defined as follows:
q
d(x(i) , x(j) ) = (x(i) − x(j) )T (x(i) − x(j) )

• quantitative variables, using norm:

d(x(i) , x(j) ) = ||x(i) − x(j) ||2

Pierre Michel Prediction methods and Machine learning 19/102


1. Complements on Unsupervised Learning
1.3. Clustering

Homogeneity criterion (inertia)


Consider m observations grouped in K clusters {C1 , ..., CK }.
∀k ∈ {1, ..., K}, gk is the barycenter, mk is the number of observations
in gk and pk is the cluster weight. We define:
• Within-cluster inertia

K
X mk
K X
X
Iw = Ik = pk ||x(i) − gk ||2 1{x(i) ∈Ck }
k=1 k=1 i=1

• Between-cluster inertia

K
X
Ib = pk ||gk − g||2
k=1

Note: g is the barycenter of the m observations.


Pierre Michel Prediction methods and Machine learning 20/102
1. Complements on Unsupervised Learning
1.3. Clustering

Huygens’ theorem

Total inertia I is the sum of within-cluster inertia and between-cluster


inertia. Huygens’ theorem thus says:

I = Iw + Ib

The number of clusters K̃ is traditionally chosen using the following homo-


geneity criterion:

K̃ = min Iw
K>0

Pierre Michel Prediction methods and Machine learning 21/102


1. Complements on Unsupervised Learning
1.3. Clustering

Homogeneity criterion: interpretation

• a cluster is even more homogeneous when its (within-)inertia is


low
• good clustering =⇒ 2 criteria: low Iw , high Ib
• these two criteria are equivalent according to Huygens’ theorem

I = Iw + Ib

Pierre Michel Prediction methods and Machine learning 22/102


1. Complements on Unsupervised Learning
1.3. Clustering

Scatterplot
1.0
0.5
0.0
−0.5
−1.0

Pierre Michel −0.5 0.0 0.5 1.0


Prediction methods and Machine learning 23/102
1. Complements on Unsupervised Learning
1.3. Clustering

Barycenter (gravity center)


1.0
0.5
0.0
−0.5
−1.0

Pierre Michel −0.5 0.0 0.5 1.0


Prediction methods and Machine learning 24/102
1. Complements on Unsupervised Learning
1.3. Clustering

Total inertia
1.0
0.5
0.0
−0.5
−1.0

Pierre Michel −0.5 0.0 0.5 1.0


Prediction methods and Machine learning 25/102
1. Complements on Unsupervised Learning
1.3. Clustering

Total inertia = Within-cluster inertia + Between-cluster


inertia 1.0
0.5
0.0
−0.5
−1.0

Pierre Michel Prediction methods and Machine learning 26/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means algorithm

Input:
• K (number of clusters)
• Training set: x(1) , x(2) , x(3) , ..., x(m)


One obervation is denoted: x(i) ∈ Rn

Pierre Michel Prediction methods and Machine learning 27/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means algorithm
Randomy initialize K cluster centroids µ1 , µ2 , ..., µK ∈ Rn
Repeat
for i = 1 to m (cluster assignment step)
c(i) := index(from 1 to K) of cluster centroid closest to x(i)

c(i) = min ||x(i) − µk ||2


k

for k = 1 to K (update centroid step)


µk := average (mean) of points assigned to cluster k

1 X
µk = x(i)
#{i : c(i) = k}
{i:c(i) =k}

Pierre Michel Prediction methods and Machine learning 28/102


1. Complements on Unsupervised Learning
1.3. Clustering

Cluster separability

Non linear separability


Linear separability
(t−shirt sizing example)
1.0 1.5 2.0 2.5 3.0

Height
x2

0.5 1.5 2.5

x1 Weight

Pierre Michel Prediction methods and Machine learning 29/102


1. Complements on Unsupervised Learning
1.3. Clustering

Cluster separability

Non linear separability


Linear separability
(t−shirt sizing example)
3.5
2.5

Height
x2

1.5

S
M
L
0.5

0.5 1.5 2.5

x1 Weight

Pierre Michel Prediction methods and Machine learning 30/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means algorithm: an optimization problem


• c(i) : index of cluster (1,2,. . . ,K) to which observation x(i) is
currently assigned
• µk : cluster centroid k (µk ∈ Rn )
• µc(i) : cluster centroid of cluster to which observation x(i) has been
assigned
Cost function:

m
1 X (i)
J(c(1) , ..., c(m) , µ1 , ..., µK ) = ||x − µc(i) ||2
m i=1
Minimization problem:

m
1 X (i)
min ||x − µc(i) ||2
(1) (2) (m) m
c ,c ,··· ,c ; i=1
µ1 ,µ2 ,··· ,µK

Pierre Michel Prediction methods and Machine learning 31/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means: random initialization and local optima

• We look for K < m


• Initialization: random pick K observations
• Set µ1 , ..., µK equal to these K observations.
Issue: K-means can converge to local optima, the returned partition
should not be the same for different runs of the algortithm.
Solution: Repeated K-means

Pierre Michel Prediction methods and Machine learning 32/102


1. Complements on Unsupervised Learning
1.3. Clustering

Repeated K-means

For i = 1 to 100*
Randomly initialize K-means
Run K-means algorithm.
Get c(1) , ..., c(m) , µ1 , ..., µK
Compute cost function

J(c(1) , ..., c(m) , µ1 , ..., µK )

Finally pick clustering that gave lowest cost J(c(1) , ..., c(m) , µ1 , ..., µK )

Pierre Michel Prediction methods and Machine learning 33/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means: the algorithm (Mac Queen’s)

• Choose random centers (K training examples randomly chosen).


• Repeat until the center converge:
I Assign each observation x(i) to its closest center (distance).
I Compute the new cluster center (with the observation assigned at the
previous step).
• Repeat the previous step 10 times and keep the partition with the
minimum within-cluster inertia (or sum of squares, or variance):

m X
X K
Iw = ||x(i) − µk ||2 1{c(i) =k}
i=1 k=1

Pierre Michel Prediction methods and Machine learning 34/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means: illustration

Example with the iris dataset


4.0
3.5
Sepal.Width

3.0
2.5
2.0

Pierre Michel 4.5 5.0 5.5 6.0 6.5 7.0


Prediction 7.5 and Machine
methods 8.0 learning 35/102
1. Complements on Unsupervised Learning
1.3. Clustering

K-means: illustration

K−means partition (K=3)


4.0
3.5
Sepal.Width

3.0
2.5
2.0

Pierre Michel 4.5 5.0 5.5 6.0 6.5 7.0


Prediction 7.5 and Machine
methods 8.0 learning 36/102
1. Complements on Unsupervised Learning
1.3. Clustering

K-means: how to choose K ?

Within−cluster inertia in function of K


700
600
500
Within−cluster inertia

400
300
200
100
0

Pierre Michel 5 10 15
Prediction 20
methods and Machine learning 37/102
1. Complements on Unsupervised Learning
1.3. Clustering

Scatterplot
0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 38/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Forgy’s algorithm: Initialization of clusters (iteration 1)


0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 39/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Forgy’s algorithm: Assign observations (iteration 1)


0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 40/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Forgy’s algorithm: Update centers (iteration 1)


0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 41/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Forgy’s algorithm: Assign observations (iteration 2)


0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 42/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Forgy’s algorithm: Update centers (iteration 2)


0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 43/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Forgy’s algorithm: Assign observations (iteration 3)


0.5
0.0
−0.5
−1.0

Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 44/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering

Extensions of K-means

• K-modes: more for qualitative data. Simple matching distance


is often used in this case (the variables with more than 2 levels should
be converted to their binary representation). It uses modes rather
means as centers of the clusters.
• K-medians: recommended when dealing with ordinal data. It uses
medians rather than means as centers of the clusters. A prefered
distance here would be the Manhattan distance.
• c-means: soft/fuzzy version of K-means in which an observation can
be assigned to more than one cluster.

Pierre Michel Prediction methods and Machine learning 45/102


1. Complements on Unsupervised Learning
1.3. Clustering

Remarks about K-means

• Pros
I This algorithm reduces within-cluster inertia at each step: it
converges
I Few iterations needed
• Cons
I instable: the partition obtained depends on the initialization: run
K-means several times. . .
I number of clusters K fixed by the user: simulations, principal
component analysis. . .

Pierre Michel Prediction methods and Machine learning 46/102


1. Complements on Unsupervised Learning
1.3. Clustering

K-means in Python

# import sklearn and numpy


from sklearn.cluster import KMeans
import numpy as np
# create some data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
# run K-means and fit to the data
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# print cluster labels
print(kmeans.labels_)
# predict clusters for new data
kmeans.predict([[0, 0], [12, 3]])
# print centers coordinates
print(kmeans.cluster_centers_)

Pierre Michel Prediction methods and Machine learning 47/102


1. Complements on Unsupervised Learning
1.3. Clustering

Hierachical clustering

Hierarchical methods constrcut a dendrogram (a binary tree).


They are also based on a distance matrix (adapted to your data).
They can be agglomerative (“bottom-top”) or descendant (“top-down”).
Need to define a between-cluster distance (or agglomerative linkage
strategy):
• single linkage
• complete linkage
• Ward’s method

Pierre Michel Prediction methods and Machine learning 48/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering

Goal: construct a set of partitions by successives clusters groupings


Output: not one partition, but a hierarchy of partitions, from m clusters
to 1 cluster, reducing between-cluster inertia at each grouping

Pierre Michel Prediction methods and Machine learning 49/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: algorithm

• Initialization of m clusters, corresponding to the m observations


• Repeat until one cluster remianing:
I Compute the distances between each pair of clusters
I Group the two nearest clusters in one
I Update the distances between each pair of clusters

Note: need to define a dissimilarity measure between clusters. How to


choose K ?

Pierre Michel Prediction methods and Machine learning 50/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8

0.6
y


0.4

● 1●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 51/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8

0.6
y


0.4

● 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 52/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8

3
0.6
y


0.4

● 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 53/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8

3
0.6
y


4
0.4

● 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 54/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8

3
0.6
y


4
0.4 5

● 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 55/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8 6

3
0.6
y


4
0.4 5

● 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 56/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8 6


7
3
0.6
y


4
0.4 5

● 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 57/102


1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering: illustration

0.8 6


7
3
0.6
y


4
0.4 5

● 8 1●2●

0.00 0.25 0.50 0.75


x

Pierre Michel Prediction methods and Machine learning 58/102


1. Complements on Unsupervised Learning
1.3. Clustering

Measuring dissimilarities between 2 clusters A and B

• Single linkage

∆(A, B) = min d(x(i) , x(j) )


x(i) ∈A,x(j) ∈B

• Complete linkage

∆(A, B) = max d(x(i) , x(j) )


x(i) ∈A,x(j) ∈B

• Ward’s method

pA pB 2
∆(A, B) = d (gA , gB )
pA + pB

Pierre Michel Prediction methods and Machine learning 59/102


1. Complements on Unsupervised Learning
1.3. Clustering

Measuring dissimilarities between 2 clusters A and B:


illustration

Single linkage Complete linkage Ward's method


3

3
2

2
1

1
y1

y1

y1
0

0
−1

−1

−1
−2

−2

−2
−3

−3

−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x1 x1 x1

Pierre Michel Prediction methods and Machine learning 60/102


Pierre Michel
Height
1.3. Clustering

0 200 400 600 800 1000

36
13
2
46
26
10
35
7
3
4
48
25
12
30
31
42
23
1. Complements on Unsupervised Learning

14
43
9
39
5
38
50
8
40
28
29
41
1
18
44
24
27
45
47
20
22
33
34
15
16
6
19
21
32
17
37
11
49
141
145
125
121
144
101
116
137
149
112
104
117
138
109
105
129
133
78
111
148
113
140
142
146
108
131
103
126
130
119
106
123
118
132
110
136
61
99
58
94
63
68
83
93
65
80
60
54
90
Cluster Dendrogram

70
81
82
107
95
100
89
96
Dendrogram: example with iris dataset

97
67
85
56
91
115
122
114
102
143
120
69
88
135
147
124
127
73
84
134
87
51
53
66
76
77
55
59
74
79
64
92
75
98
62
72
150
71
128
139
86
52
57
Prediction methods and Machine learning 61/102
1. Complements on Unsupervised Learning
1.3. Clustering

Ascendant hierarchical clustering in Python


from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
# create some data 8x1
X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]
# ACH with single linage
Z = linkage(X, 'ward')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
# ACH with Ward's method
Z = linkage(X, 'single')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
# plot
plt.show()

Note: X can be a distance matrix.


Pierre Michel Prediction methods and Machine learning 62/102
1. Complements on Unsupervised Learning
1.3. Clustering

What about distance matrices. . . in Python ?


import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform

x1 = np.random.normal(4, 1, 4)
x2 = np.random.normal(6, 1, 4)
y1 = np.random.normal(4, 1, 4)
y2 = np.random.normal(6, 1, 4)

plt.scatter(x1,y1)
plt.scatter(x2,y2)
plt.show()

x = np.append(x1,x2)
y = np.append(y1,y2)
dat = np.c_[x,y]
dist = pdist(dat, metric = "euclidean")
print(squareform(dist).shape)

Pierre Michel Prediction methods and Machine learning 63/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

1.4 Data simulation and performance evaluation

Pierre Michel Prediction methods and Machine learning 64/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Ordinal and nominal data simulation models

By definition, in unsupervised learning, we cannot have access to a vector


of target values y. To overcome this issue and be able to compute a “proxy”
error rate, we could think to use simulated data.
In the following, we propose some simulation models for qualitative data.
We define some parameters:
• K the number of clusters
• n the number of variables
• m ∈ {100, 300, 500, 1000}
• lj = l is the number of levels
• m
K the number of obervations per group

Pierre Michel Prediction methods and Machine learning 65/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Model 1: a simple model

We set K = 3, n = 9, l = 5. each cluster is characterized by a high


frequency of one level.
In this example, levels 1, 3 and 5 are the most frequent for clusters 1, 2
and 3 respectively.
The other levels are uniformly distributed. For example, the distribution for
each variable xj in cluster 1 is defined as follows:

P (xj = 1) = q
1−q
P (xj = x) = ∀x 6= 1
l−1

The same is done for the other clusters. A good choice for q would be 0.8
(high frequency).

Pierre Michel Prediction methods and Machine learning 66/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Model 2: IRT-based simulation


We set K = 3. We propose an approach based on Item Response Theory
(IRT).
We herein use the Generalized Partial Credit Model (GPCM), the prob-
ability for an individual i to give the response x to item j of a questionnaire
is:
Px
(i) exp ( k=1 αj (θi − βjk ))
pjx (θ) = P (xj = x|θ) = Pl Pr
r=1 exp ( k=0 αj (θi − βjk ))

• θ is the individual parameter (also called latent trait or ability)


• βjk is the threshold parameter for the k-th level of item j
• αj is the discrimination parameter of item j
For each class c ∈ {1, ..., K} we simulate a vector of m values (θi )i=1,...,m
distributed as N (µc , σ 2 ) with µ1 = −3, µ2 = 0, µ3 = 3 and σ 2 = 0.1.
Feach item j, we set αj = 1.2 and βj = −1, − 31 , 13 , 1 .

Pierre Michel Prediction methods and Machine learning 67/102
1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Model 3: Tree-based model

K = 4, p = 3, l = 6. Each level is encoded as an integer, and we distinguish


odd and even levels. Clusters are defined as follows:
• C1 : x1 and x2 have odd levels, x3 is random
• C2 : x1 has odd levels, x2 has even levels, x3 is random
• C3 : x1 has even levels, x3 has odd levels, x2 is random
• C4 : x1 and x3 have even levels, x2 is random

Figure 2: Tree structure used for model 3

Pierre Michel Prediction methods and Machine learning 68/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Model 4: Another tree-based model

K = 4, p = 3, l = 4.
The only difference with previous model is that the levels are not uniformly
distributed in each cluster.
Let’s consider a parameter p0 controlling for the non-uniformity of levels
distribution, for example set p0 = 0.8 and define the clusters as follows:
• C1 : x1 and x2 have odd levels with P (x1 = 1) = P (x2 = 1) = p0 ,
x3 is random
• C2 : x1 has odd levels, x2 has even levels with
P (x1 = 1) = P (x2 = 2) = p0 , x3 is random
• C3 : x1 has even levels, x3 has odd levels, with
P (x1 = 2) = P (x3 = 1) = p0 , x2 is random
• C4 : x1 and x3 have even levels with P (x1 = 2) = P (x3 = 2) = p0 ,
x2 is random

Pierre Michel Prediction methods and Machine learning 69/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Performance criterion: Category utility

Consider a partition C = {Ck }k=1,...,K , found by a clustering algorithm


based on given features (variables) fj , j = 1, ..., n. The features are
assumed to be nominal so that each value that fj can take has the form
vjl .
The category utility function scores a partition C given a set of features:
according to the formula:

 
K
1 X XX XX
CU (C) = P (Ck )  P (fj = vjl |Ck )2 − P (fj = vjl )2 
K j j
k=1 l l

This criterion (not based on inertia) is useful to check the quality of


partition using a clustering method on qualitative data.

Pierre Michel Prediction methods and Machine learning 70/102


1. Complements on Unsupervised Learning
1.4. Data simulation and performance evaluation

Another performance criterion: Misclassification Error

The Misclassification Error (ME) rate can be used as follows:


Let y1 , y2 , ..., ym be the class labels of each observation (in practice, you
do not have access to y, so data simulation is required).
Let ŷ1 , ŷ2 , ..., ŷm the “predicted” labels assigned by a clustering algorithm.
We denote Σ the set of all possible permutations of the predicted labels.
The ME rate, also called “matching error” is defined as follows:

m
1 X
ME = min 1{yi 6=σ(ŷi )}
σ∈Σ m i=1

Can empirically solve the label switching curse, typical to clustering tasks.

Pierre Michel Prediction methods and Machine learning 71/102


2. Recent approaches in clustering

2. Recent approaches in clustering

Pierre Michel Prediction methods and Machine learning 72/102


2. Recent approaches in clustering
2.1. Density-based spatial clustering application with noise (DBSCAN)

2.1 Density-based spatial clustering application with


noise (DBSCAN)

Pierre Michel Prediction methods and Machine learning 73/102


2. Recent approaches in clustering
2.1. Density-based spatial clustering application with noise (DBSCAN)

DBSCAN

DBSCAN is a density-based, non-hierarchical clustering method that


uses only two parameters:
•  which is a reachability distance (a radius)
• M inP ts the minimum number of points (or training examples)
required to form a cluster
These two parameters are for tuning the method and have to be fixed by
the user.
The goal is to separate high density regions (determined by core points)
and low density regions (determined by noise points) in the feature space.

Pierre Michel Prediction methods and Machine learning 74/102


2. Recent approaches in clustering
2.1. Density-based spatial clustering application with noise (DBSCAN)

DBSCAN: illustration

Figure 3: Illustration of DBSCAN (Wikipedia). Red points represent a high-density region, the
blue point represent a low-density region, yellow points represent the “frontiers” of their cluster.

Pierre Michel Prediction methods and Machine learning 75/102


2. Recent approaches in clustering
2.1. Density-based spatial clustering application with noise (DBSCAN)

DBSCAN: the algorithm

• Choose a random point (observation) x(i)


0 from the data.
• Construct the -neighborhood of this point (the set of points that are
(i)
at distance less than  from x0 ).
I if there are at least M inP ts points in the -neighborhood, then it will
form a cluster (high-density region)
I otherwise, the points in the -neighborhood will be considered as
noise (low-density region)

All dense points found in an -neighborhood are added to the cluster.


Once no dense point is found (we will talk about border points), another
random point is chosen and the process is repeated to explore new clusters.

Pierre Michel Prediction methods and Machine learning 76/102


2. Recent approaches in clustering
2.1. Density-based spatial clustering application with noise (DBSCAN)

DBSCAN in Python

from sklearn.cluster import DBSCAN


import numpy as np
# create some data 6x2
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
# fit DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
# cluster labels
print(clustering.labels_)
# model info
print(clustering)

Pierre Michel Prediction methods and Machine learning 77/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

2.2 Clustering using binary trees

Pierre Michel Prediction methods and Machine learning 78/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Clustering using unsupervised binary trees

CUBT is a top-down hierarchical clustering methods that works in 3 steps:


• growing the maximal tree: recursive binary partitioning
• pruning the tree (dissimilarity-based pruning)
• joining the leaves of the tree (alternative pruning)

Pierre Michel Prediction methods and Machine learning 79/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Similarities with CART

CUBT has many similarities with CART:


• Efficiency
• Flexibility
• Interpretability
• Good convergence properties

Pierre Michel Prediction methods and Machine learning 80/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Step 1: Growing the maximal tree

Let t be a tree node containing a set of observations in Rp . The child


nodes of t are denoted tL and tR , defined as follows:

tL = {x ∈ Rp |xj ≤ a} and tR = {x ∈ Rp |xj > a}

Let Xt = {x|x ∈ t}, αt = P (x ∈ t) and R(t) a heterogeneity measure


(deviance) of t, defined as:

R(t) = αt tr(cov(Xt ))

The best split of t is defined by the pair (j, a) ∈ {1, ...p} × R maximizing

∆(t, j, a) = R(t) − R(tL ) − R(tR )

Pierre Michel Prediction methods and Machine learning 81/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Step 1: Growing the maximal tree

We denote S the initial training dataset, each node t is split recursively


until the following stopping criteria are verified:
• All observations in t are the same
• There are less than minsize observations in t
• ∆(t, j, a) < mindev × R(S)
The clustering tree represents the partition. Each leaf represents a cluster.

Pierre Michel Prediction methods and Machine learning 82/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Step 2: Pruning the tree


We denote tL and tR the leaves obtained by splitting t.
Pruning criterion
If dδ (L, R) ≤ mindist then tL and tR are aggregated

dδ (L, R) is an empirical dissimilarity measure between tL and tR :

dδ (L, R) = max(d¯δL , d¯δR )

where ∀δ ∈ [0, 1]

δnL δnR
1 X 1 X
d¯δL = di and d¯δR = dj
δnL i=1 δnR j=1

Pierre Michel Prediction methods and Machine learning 83/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Dissimilarity measure: illustration

Pierre Michel Prediction methods and Machine learning 84/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

Step 3: Joining the leaves

We aggregate leaves that are not issued from a same parent


Two joining criteria
Leaves are compared using:
1. ∆(tL , tR ) = R(tL ∪ tR ) − R(tL ) − R(tR )
2. ∆(tL , tR ) = dδ (L, R)

Let NL be the total number of leaves and K the expected number of classes.
∀(L, R) ∈ {1, ..., NL } and L 6= R we have (L̃, R̃) = argminL,R ∆(tL , tR )

Joining the leaves


tL̃ and tR̃ are replaced by their union tL̃ ∪ tR̃ and NL = NL − 1. Stop
when NL = K.

Pierre Michel Prediction methods and Machine learning 85/102


2. Recent approaches in clustering
2.2. Clustering using binary trees

CUBT: pros and cons

Pros:
• Decisional method
• Interpretable clustering
• Extensions to other types of data (ordinal, nominal)
• Adapted to parallel computing
• Partition of the feature space, not only the training dataset
Cons:
• Same as CART
• Trees are unstable

Pierre Michel Prediction methods and Machine learning 86/102


2. Recent approaches in clustering
2.3. Variable importance in CUBT

2.3 Variable importance in CUBT

Pierre Michel Prediction methods and Machine learning 87/102


2. Recent approaches in clustering
2.3. Variable importance in CUBT

Motivation and objectives

Motivation
• Feature selection
• Dimension reduction
• Missing data
Objectives
• Define variable importance in CUBT
• Analyze its stability
• Compare to other methods

Pierre Michel Prediction methods and Machine learning 88/102


2. Recent approaches in clustering
2.3. Variable importance in CUBT

Competitive splits
To compute the importance of a feature j, we define the competitive split
of a feature j0 in a node t.

Pierre Michel Prediction methods and Machine learning 89/102


2. Recent approaches in clustering
2.3. Variable importance in CUBT

Competitive splits

The probability that an observation is sent to the left node for both splits
is

#{tL ∩ t0L }
p(tL ∩ t0L ) =
nt

Given that an observation is in t, the probability that both splits sent it to


the left is

p(tL ∩ t0L )
pLL (s, sj ) =
p(t)

pRR can be defined equivalently

Pierre Michel Prediction methods and Machine learning 90/102


2. Recent approaches in clustering
2.3. Variable importance in CUBT

Surrogate splits and variable importance

We define an association measure between sj and s

p(s, sj ) = pLL + pRR

s̃j is a surrogate split of s if


p(s, s̃j ) = maxsj p(s, sj )
The importance of the variable j is given by

X
Imp(Xj ) = ∆(R(s̃j , t))
t

which is the loss of deviance induced if each node is replaced by the


surrogate split defined on Xj .

Pierre Michel Prediction methods and Machine learning 91/102


2. Recent approaches in clustering
2.3. Variable importance in CUBT

Conclusion

• CUBT is an interpretable clustering method


• Measure of variable importance in CUBT
• Heuristics have been proposed for tuning the method
• Stability of variable importance
How to use CUBT ? (R users only. . . Python version in development)
What about clustering time-series ???
Let’s check a recent work of hierarchical clustering of time series, in
the field of epidemiology, with my colleague Sokhna Dieng (PhD Student
in Statistics, EHESP/SESSTIM).

Pierre Michel Prediction methods and Machine learning 92/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

2.4 Topic modelling using Latent Dirichlet


Allocation (LDA)

Pierre Michel Prediction methods and Machine learning 93/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

Context: Natural Language Processing (NLP)

• Aim: find topics in documents (useful for search, browsing,


information retrieval, NLP).
• Problem: no supplementary information (target variable y) about the
documents is available, just the text: Unsupervised task.
• Fuzzy clustering problem: a document (observation) can be
assigned to several topics (clusters): for example a scientific article
related to both finance and machine learning. . .
• (Best) method: LDA

Pierre Michel Prediction methods and Machine learning 94/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

Some assumptions of LDA

• a document can be related to multiple topics = an observation


can be assigned to more than one cluster.
• LDA is a type of probabilistic model called a generative process (a
document is generated using this process).
• A topic is a distribution generated over a mixture of words (a topic
is generated before the documents using this process).
• The main tuning parameter you have to choose is K, the number
of topics (as in K-means !!!).
We will see that other tuning parameters take place in applications in
Python (see the notebook attached to this session !).

Pierre Michel Prediction methods and Machine learning 95/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

LDA: a generative process

How does LDA do to generate a document ?


1. Randomly choose a topic distribution (a vector of K probabilities)
2. For each word:
I randomly choose a topic from the topic distribution
I randomly choose a word from this topic (which is itself a distribution
of words/tokens)

Note: words are independent from others (unigram bag-of-words model):


that’s why LDA is the not the best method.

Pierre Michel Prediction methods and Machine learning 96/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

LDA: notations and probabilistic approach

• β1:K correspond to topics, where ∀k, βk is a vector of probabilities


(one probability for each word/token)
• θd are the topic proportions of document d (a vector of probabilities)
• θd,k is the proportion of topic k in document d
• zd are the topic assignments for document d
• zd,n is the topic assignment for word n in document d
• wd are the observed words in document d
The generative process corresponds to the following joint probability:

K
Y D
Y N
Y
p(β1:K , θ1:D , z1:D , w1:D ) = p(βi ) p(θd ) p(zd,n |θd )p(wd,n |β1:K , zd,n )
i=1 d=1 n=1

Pierre Michel Prediction methods and Machine learning 97/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

LDA: illustration (traditional path diagram)

𝛼 𝜃𝑑 𝑧𝑑,𝑛 𝑤 𝑑,𝑛 𝛽𝑘 𝛿

𝑁
𝐷 𝐾

Figure 4: Traditional path diagram to illustrate LDA (inspired from Blei et al.). rows represent
conditional probabilities that are used for the generative process. Rectangles represent the
replications of the process. The blue node corresponds to observed variables (words).
Pierre Michel Prediction methods and Machine learning 98/102
2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

Remarks about parameter estimation

• The variables that will be interesting for interpreting of the results are:
I βk the vector of word probabilities for topic k
I θdk the topic proportion for topic k in document d
• The generation process uses two usual probability distribution (check
functions in numpy.random):
I multinomial distribution
I Dirichlet distribution
• Parameter etimation is based on Gibbs sampling
At each iteration of the algorithm, we get updated values of βk and θdk .
The number of iterations (passes) is chosen by the user.

Pierre Michel Prediction methods and Machine learning 99/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

Probability estimates using Gibbs sampling

Some notations:
• zi is the topic assigned to token i in the corpus
• di is the document containing token i
• wi is the observed token (word)
• z−1 is the topics assigned to other tokens
Then we have:
WT
Cwij
+δ CdDT
ij

P (zi = j|z−1 , wi , di , α, δ) = PW PT
w=1
WT + Wδ
Cwij t=1 CdDT
it
+ Tα

where C W T and C DT are matrices of counts (for word-topic pairs and


document-topic pairs).

Pierre Michel Prediction methods and Machine learning 100/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

Probability estimates using Gibbs sampling

The parameters of interest will be estimated as follows:

WT
Cwij

βik = PW
WT + Wδ
Cw
w=1 ij

CdDT
ij

θdj = PT
t=1 CdDT
it
+ Tα

And now. . . let’s try LDA in Python !

Pierre Michel Prediction methods and Machine learning 101/102


2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)

Pierre Michel Prediction methods and Machine learning 102/102

You might also like