Session6 PDF

Prediction methods and Machine learning
Session VI
Pierre Michel
pierre.michel@univ-amu.fr
M2 EBDS
2021
Pretty parametrization and visualization of a Neural
Network and applications in Econometrics
Playground Tensorflow
Let’s also check the work of Loann Desboulets (PhD Student in Economet-
rics, AMSE).
Pierre Michel Prediction methods and Machine learning 2/102

1. Complements on Unsupervised Learning

1.1. Supervised versus Unsupervised Learning
1.1 Supervised versus Unsupervised Learning

Supervised versus Unsupervised Learning
What are the differences ?

• Supervised and unsupervised Learning share many features within the
algorithms they use to estimate a model.
• Supervised Learning: a target variable Y is available (regression:
Y ∈ R, classification: Y ∈ {1, ...K}).
• Unsupervised Learning (no target variable Y ) concerns Density
Estimation and Clustering.
Note: in many cases, density estimation resolves clustering.

Supervised learning
(1) (1)
Training set: (x , y ), (x(2) , y (2) ), (x(3) , y (3) ), ..., (x(m) , y (m) )
3.0
2.0
x2
1.0
0.0
Pierre Michel
0.0 0.5 1.0 1.5 2.0Prediction2.5 3.0
methods and Machine learning 6/102
Unsupervised learning
(1) (2) (3)
Training set: x , x , x , ..., x(m)
3.0
2.0
x2
1.0
0.0
Pierre Michel
0.0 0.5 1.0 1.5 2.0Prediction2.5 3.0
1.2. Density estimation
1.2 Density estimation

Density estimators
Density estimation is the unsupervised version of regression.

The goal is to estimate the density of a random variable or random
vector x given n i.i.d observations of this random variable.
• Parametric estimators: often considers a mixture of densities,
assume a specific form for the density, and estimate parameters using
Maximum Likelihood or Expectation Maximization.
K
X
f (x) = αk fk (x)
k=1

Parametric estimator: monovariable example (2 densities)
One density Mixture of two densities
200
250
200
150
Frequency
150
Frequency
100
100
50
50
0
0
0 2 4 6 8 10 0 2 4 6 8 10
Variable Variable
This problem can also be interpreted as a 2-cluster clustering task (see

further).
Monovariable example in Python: dealing with histograms

import numpy as np
from matplotlib import pyplot as plt
bins = np.linspace(0, 8, 100)
x1 = np.random.normal(4, 1, 1000)
plt.hist(x1, bins, alpha=0.5, label='x1')

plt.hist(x2, bins, alpha=0.5, label='x2')
plt.legend(loc='upper right')
plt.show()
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.hist(np.append(x1,x2), bins, alpha=0.5, label='x')
ax2.hist(x1, bins, alpha=0.5, label='x1')
ax2.hist(x2, bins, alpha=0.5, label='x2')
Monovariable example in Python: dealing with histograms

Density estimators
• Non-parametric estimators: using histograms or more generally
Kernel Density Estimators, select a kernel function κ and
consider the following estimator:
m
x − x(i)

1 X
f (x) = κ
mh i=1 h
where h is the bandwidth parameter, controlling the width of the breaks

of the histogram (or kernels).
You should make vary the value of the argument breaks in order to test
different estimators.
• Ensemble methods: averaging, boosting, stacking (among others).
• Recent approaches: see the work of Mathias Bourel about boosting
approaches for recent examples. . .
1.3. Clustering
1.3 Clustering

1.3. Clustering
What is clustering ?
Clustering aims to partition the data. The data represents a set of

unlabeled observations.
We have seen that some supervised methods are also based on partition-
ing the feature space (CART and its extensions, SVM, linear discriminant
analysis. . . ).
Unsupervised partitioning (clustering) may be:
• hierarchical: AHC (agglomerative hierarchical clustering), CUBT. . .
• partitional non-hierarchical: K-means. . .
• density-based: mixture models, DBSCAN. . .
• soft/fuzzy: c-means, Latent Dirichlet Allocation (LDA, specific to
topic modelling). . .

1.3. Clustering
Different applications of clustering
• Market segmentation
• Clinical medicine
• Social networtk analysis
• Cluster computing
• Astronomical data analysis
• Genetical data analysis
• ...

1.3. Clustering
Clutering: which type of data can you cluster ?
Any type of data can be clustered.

The data can be of different types:
• continous (quantitative)
• ordinal (quantitative)
• nominal (qualitative)
• longitudinal (time-series)
• ...
Most widely used method: K-means (MacQueen’s algorithm), based on
a distance matrix adapted to the data you use.

1.3. Clustering
Dissimilarity measures
Consider a set of m observations (x(1) , x(2) , ..., x(m) ), ∀i, x(i) ∈ Rn

A dissimilarity measure verifies the following properties, ∀i, j, k ∈
{1, ..., m}:
1. d(x(i) , x(i) ) = 0
2. d(x(i) , x(j) ) = d(x(j) , x(i) )
3. d(x(i) , x(j) ) = 0 =⇒ x(i) = x(j) (dissimilarity)
4. d(x(i) , x(j) ) ≤ d(x(i) , x(k) ) + d(x(k) , x(j) ) (distance)
5. d(x(i) , x(j) ) ≤ max(d(x(i) , x(k) ), d(x(k) , x(j) )) (ultrametric distance)

1.3. Clustering
Choosing a dissimilarity measure
• Depends on variables’ type. Generally we use the euclidean

distance, defined as follows:
q
d(x(i) , x(j) ) = (x(i) − x(j) )T (x(i) − x(j) )
• quantitative variables, using norm:
d(x(i) , x(j) ) = ||x(i) − x(j) ||2

1.3. Clustering
Homogeneity criterion (inertia)

Consider m observations grouped in K clusters {C1 , ..., CK }.
∀k ∈ {1, ..., K}, gk is the barycenter, mk is the number of observations
in gk and pk is the cluster weight. We define:
• Within-cluster inertia
K
X mk
K X
X
Iw = Ik = pk ||x(i) − gk ||2 1{x(i) ∈Ck }
k=1 k=1 i=1
• Between-cluster inertia
K
X
Ib = pk ||gk − g||2
k=1
Note: g is the barycenter of the m observations.

1.3. Clustering
Huygens’ theorem
Total inertia I is the sum of within-cluster inertia and between-cluster

inertia. Huygens’ theorem thus says:
I = Iw + Ib
The number of clusters K̃ is traditionally chosen using the following homo-

geneity criterion:
K̃ = min Iw
K>0

1.3. Clustering
Homogeneity criterion: interpretation
• a cluster is even more homogeneous when its (within-)inertia is

low
• good clustering =⇒ 2 criteria: low Iw , high Ib
• these two criteria are equivalent according to Huygens’ theorem
I = Iw + Ib

1.3. Clustering
Scatterplot
1.0
0.5
0.0
−0.5
−1.0
Pierre Michel −0.5 0.0 0.5 1.0

Prediction methods and Machine learning 23/102
1.3. Clustering
Barycenter (gravity center)

1.0
0.5
0.0
−0.5
−1.0

1.3. Clustering
Total inertia
1.0
0.5
0.0
−0.5
−1.0

1.3. Clustering
Total inertia = Within-cluster inertia + Between-cluster

inertia 1.0
0.5
0.0
−0.5
−1.0

1.3. Clustering
K-means algorithm
Input:
• K (number of clusters)
• Training set: x(1) , x(2) , x(3) , ..., x(m)

One obervation is denoted: x(i) ∈ Rn

1.3. Clustering
K-means algorithm
Randomy initialize K cluster centroids µ1 , µ2 , ..., µK ∈ Rn
Repeat
for i = 1 to m (cluster assignment step)
c(i) := index(from 1 to K) of cluster centroid closest to x(i)
c(i) = min ||x(i) − µk ||2

k
for k = 1 to K (update centroid step)

µk := average (mean) of points assigned to cluster k
1 X
µk = x(i)
#{i : c(i) = k}
{i:c(i) =k}

1.3. Clustering
Cluster separability
Non linear separability

Linear separability
(t−shirt sizing example)
1.0 1.5 2.0 2.5 3.0
Height
x2
0.5 1.5 2.5
x1 Weight

1.3. Clustering
Cluster separability
Non linear separability

Linear separability
(t−shirt sizing example)
3.5
2.5
Height
x2
1.5
S
M
L
0.5
0.5 1.5 2.5
x1 Weight

1.3. Clustering
K-means algorithm: an optimization problem

• c(i) : index of cluster (1,2,. . . ,K) to which observation x(i) is
currently assigned
• µk : cluster centroid k (µk ∈ Rn )
• µc(i) : cluster centroid of cluster to which observation x(i) has been
assigned
Cost function:
m
1 X (i)
J(c(1) , ..., c(m) , µ1 , ..., µK ) = ||x − µc(i) ||2
m i=1
Minimization problem:
m
1 X (i)
min ||x − µc(i) ||2
(1) (2) (m) m
c ,c ,··· ,c ; i=1
µ1 ,µ2 ,··· ,µK

1.3. Clustering
K-means: random initialization and local optima
• We look for K < m

• Initialization: random pick K observations
• Set µ1 , ..., µK equal to these K observations.
Issue: K-means can converge to local optima, the returned partition
should not be the same for different runs of the algortithm.
Solution: Repeated K-means

1.3. Clustering
Repeated K-means
For i = 1 to 100*
Randomly initialize K-means
Run K-means algorithm.
Get c(1) , ..., c(m) , µ1 , ..., µK
Compute cost function
J(c(1) , ..., c(m) , µ1 , ..., µK )
Finally pick clustering that gave lowest cost J(c(1) , ..., c(m) , µ1 , ..., µK )

1.3. Clustering
K-means: the algorithm (Mac Queen’s)
• Choose random centers (K training examples randomly chosen).

• Repeat until the center converge:
I Assign each observation x(i) to its closest center (distance).
I Compute the new cluster center (with the observation assigned at the
previous step).
• Repeat the previous step 10 times and keep the partition with the
minimum within-cluster inertia (or sum of squares, or variance):
m X
X K
Iw = ||x(i) − µk ||2 1{c(i) =k}
i=1 k=1

1.3. Clustering
K-means: illustration
Example with the iris dataset

4.0
3.5
Sepal.Width
3.0
2.5
2.0
Pierre Michel 4.5 5.0 5.5 6.0 6.5 7.0

Prediction 7.5 and Machine
methods 8.0 learning 35/102
1.3. Clustering
K-means: illustration
K−means partition (K=3)

4.0
3.5
Sepal.Width
3.0
2.5
2.0
Pierre Michel 4.5 5.0 5.5 6.0 6.5 7.0

Prediction 7.5 and Machine
methods 8.0 learning 36/102
1.3. Clustering
K-means: how to choose K ?
Within−cluster inertia in function of K

700
600
500
Within−cluster inertia
400
300
200
100
0
Pierre Michel 5 10 15
Prediction 20
1.3. Clustering
Scatterplot
0.5
0.0
−0.5
−1.0
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 38/102
Prediction methods
1.3. Clustering
Forgy’s algorithm: Initialization of clusters (iteration 1)

0.5
0.0
−0.5
−1.0
Prediction methods
1.3. Clustering
Forgy’s algorithm: Assign observations (iteration 1)

0.5
0.0
−0.5
−1.0
Prediction methods
1.3. Clustering
Forgy’s algorithm: Update centers (iteration 1)

0.5
0.0
−0.5
−1.0
Prediction methods
1.3. Clustering

0.5
0.0
−0.5
−1.0
Prediction methods
1.3. Clustering
Forgy’s algorithm: Update centers (iteration 2)

0.5
0.0
−0.5
−1.0
Prediction methods
1.3. Clustering

0.5
0.0
−0.5
−1.0
Prediction methods
1.3. Clustering
Extensions of K-means
• K-modes: more for qualitative data. Simple matching distance

is often used in this case (the variables with more than 2 levels should
be converted to their binary representation). It uses modes rather
means as centers of the clusters.
• K-medians: recommended when dealing with ordinal data. It uses
medians rather than means as centers of the clusters. A prefered
distance here would be the Manhattan distance.
• c-means: soft/fuzzy version of K-means in which an observation can
be assigned to more than one cluster.

1.3. Clustering
Remarks about K-means
• Pros
I This algorithm reduces within-cluster inertia at each step: it
converges
I Few iterations needed
• Cons
I instable: the partition obtained depends on the initialization: run
K-means several times. . .
I number of clusters K fixed by the user: simulations, principal
component analysis. . .

1.3. Clustering
K-means in Python
# import sklearn and numpy

from sklearn.cluster import KMeans
import numpy as np
# create some data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
# run K-means and fit to the data
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# print cluster labels
print(kmeans.labels_)
# predict clusters for new data
kmeans.predict([[0, 0], [12, 3]])
# print centers coordinates
print(kmeans.cluster_centers_)

1.3. Clustering
Hierachical clustering
Hierarchical methods constrcut a dendrogram (a binary tree).

They are also based on a distance matrix (adapted to your data).
They can be agglomerative (“bottom-top”) or descendant (“top-down”).
Need to define a between-cluster distance (or agglomerative linkage
strategy):
• single linkage
• complete linkage
• Ward’s method

1.3. Clustering
Ascendant hierarchical clustering
Goal: construct a set of partitions by successives clusters groupings

Output: not one partition, but a hierarchy of partitions, from m clusters
to 1 cluster, reducing between-cluster inertia at each grouping

1.3. Clustering
Ascendant hierarchical clustering: algorithm
• Initialization of m clusters, corresponding to the m observations

• Repeat until one cluster remianing:
I Compute the distances between each pair of clusters
I Group the two nearest clusters in one
I Update the distances between each pair of clusters
Note: need to define a dissimilarity measure between clusters. How to

choose K ?

1.3. Clustering
Ascendant hierarchical clustering: illustration
0.8
0.6
y
●
●
0.4
● 1●
●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8
0.6
y
●
●
0.4
● 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8
3
0.6
y
●
●
0.4
● 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8
3
0.6
y
●
●
4
0.4
● 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8
3
0.6
y
●
●
4
0.4 5
● 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8 6
3
0.6
y
●
●
4
0.4 5
● 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8 6
●
7
3
0.6
y
●
●
4
0.4 5
● 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
0.8 6
●
7
3
0.6
y
●
●
4
0.4 5
● 8 1●2●
●
0.00 0.25 0.50 0.75

x

1.3. Clustering
Measuring dissimilarities between 2 clusters A and B
• Single linkage
∆(A, B) = min d(x(i) , x(j) )

x(i) ∈A,x(j) ∈B
• Complete linkage
∆(A, B) = max d(x(i) , x(j) )

x(i) ∈A,x(j) ∈B
• Ward’s method
pA pB 2
∆(A, B) = d (gA , gB )
pA + pB

1.3. Clustering
Measuring dissimilarities between 2 clusters A and B:

illustration
Single linkage Complete linkage Ward's method

3
3
2
2
1
1
y1
y1
y1
0
0
−1
−1
−1
−2
−2
−2
−3
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x1 x1 x1

Pierre Michel
Height
1.3. Clustering
0 200 400 600 800 1000
36
13
2
46
26
10
35
7
3
4
48
25
12
30
31
42
23
14
43
9
39
5
38
50
8
40
28
29
41
1
18
44
24
27
45
47
20
22
33
34
15
16
6
19
21
32
17
37
11
49
141
145
125
121
144
101
116
137
149
112
104
117
138
109
105
129
133
78
111
148
113
140
142
146
108
131
103
126
130
119
106
123
118
132
110
136
61
99
58
94
63
68
83
93
65
80
60
54
90
Cluster Dendrogram
70
81
82
107
95
100
89
96
Dendrogram: example with iris dataset
97
67
85
56
91
115
122
114
102
143
120
69
88
135
147
124
127
73
84
134
87
51
53
66
76
77
55
59
74
79
64
92
75
98
62
72
150
71
128
139
86
52
57
1.3. Clustering
Ascendant hierarchical clustering in Python

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
# create some data 8x1
X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]
# ACH with single linage
Z = linkage(X, 'ward')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
# ACH with Ward's method
Z = linkage(X, 'single')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
# plot
plt.show()
Note: X can be a distance matrix.

1.3. Clustering
What about distance matrices. . . in Python ?

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
y1 = np.random.normal(4, 1, 4)
y2 = np.random.normal(6, 1, 4)
plt.scatter(x1,y1)
plt.scatter(x2,y2)
plt.show()
x = np.append(x1,x2)
y = np.append(y1,y2)
dat = np.c_[x,y]
dist = pdist(dat, metric = "euclidean")
print(squareform(dist).shape)

1.4. Data simulation and performance evaluation
1.4 Data simulation and performance evaluation

Ordinal and nominal data simulation models
By definition, in unsupervised learning, we cannot have access to a vector

of target values y. To overcome this issue and be able to compute a “proxy”
error rate, we could think to use simulated data.
In the following, we propose some simulation models for qualitative data.
We define some parameters:
• K the number of clusters
• n the number of variables
• m ∈ {100, 300, 500, 1000}
• lj = l is the number of levels
• m
K the number of obervations per group

Model 1: a simple model
We set K = 3, n = 9, l = 5. each cluster is characterized by a high

frequency of one level.
In this example, levels 1, 3 and 5 are the most frequent for clusters 1, 2
and 3 respectively.
The other levels are uniformly distributed. For example, the distribution for
each variable xj in cluster 1 is defined as follows:
P (xj = 1) = q
1−q
P (xj = x) = ∀x 6= 1
l−1
The same is done for the other clusters. A good choice for q would be 0.8
(high frequency).

Model 2: IRT-based simulation

We set K = 3. We propose an approach based on Item Response Theory
(IRT).
We herein use the Generalized Partial Credit Model (GPCM), the prob-
ability for an individual i to give the response x to item j of a questionnaire
is:
Px
(i) exp ( k=1 αj (θi − βjk ))
pjx (θ) = P (xj = x|θ) = Pl Pr
r=1 exp ( k=0 αj (θi − βjk ))
• θ is the individual parameter (also called latent trait or ability)

• βjk is the threshold parameter for the k-th level of item j
• αj is the discrimination parameter of item j
For each class c ∈ {1, ..., K} we simulate a vector of m values (θi )i=1,...,m
distributed as N (µc , σ 2 ) with µ1 = −3, µ2 = 0, µ3 = 3 and σ 2 = 0.1.
Feach item j, we set αj = 1.2 and βj = −1, − 31 , 13 , 1 .

Model 3: Tree-based model
K = 4, p = 3, l = 6. Each level is encoded as an integer, and we distinguish

odd and even levels. Clusters are defined as follows:
• C1 : x1 and x2 have odd levels, x3 is random
• C2 : x1 has odd levels, x2 has even levels, x3 is random
• C3 : x1 has even levels, x3 has odd levels, x2 is random
• C4 : x1 and x3 have even levels, x2 is random
Figure 2: Tree structure used for model 3

Model 4: Another tree-based model
K = 4, p = 3, l = 4.
The only difference with previous model is that the levels are not uniformly
distributed in each cluster.
Let’s consider a parameter p0 controlling for the non-uniformity of levels
distribution, for example set p0 = 0.8 and define the clusters as follows:
• C1 : x1 and x2 have odd levels with P (x1 = 1) = P (x2 = 1) = p0 ,
x3 is random
• C2 : x1 has odd levels, x2 has even levels with
P (x1 = 1) = P (x2 = 2) = p0 , x3 is random
• C3 : x1 has even levels, x3 has odd levels, with
P (x1 = 2) = P (x3 = 1) = p0 , x2 is random
• C4 : x1 and x3 have even levels with P (x1 = 2) = P (x3 = 2) = p0 ,
x2 is random

Performance criterion: Category utility
Consider a partition C = {Ck }k=1,...,K , found by a clustering algorithm

based on given features (variables) fj , j = 1, ..., n. The features are
assumed to be nominal so that each value that fj can take has the form
vjl .
The category utility function scores a partition C given a set of features:
according to the formula:
 
K
1 X XX XX
CU (C) = P (Ck )  P (fj = vjl |Ck )2 − P (fj = vjl )2 
K j j
k=1 l l
This criterion (not based on inertia) is useful to check the quality of

partition using a clustering method on qualitative data.

Another performance criterion: Misclassification Error
The Misclassification Error (ME) rate can be used as follows:

Let y1 , y2 , ..., ym be the class labels of each observation (in practice, you
do not have access to y, so data simulation is required).
Let ŷ1 , ŷ2 , ..., ŷm the “predicted” labels assigned by a clustering algorithm.
We denote Σ the set of all possible permutations of the predicted labels.
The ME rate, also called “matching error” is defined as follows:
m
1 X
ME = min 1{yi 6=σ(ŷi )}
σ∈Σ m i=1
Can empirically solve the label switching curse, typical to clustering tasks.

2. Recent approaches in clustering

2.1. Density-based spatial clustering application with noise (DBSCAN)
2.1 Density-based spatial clustering application with

noise (DBSCAN)

DBSCAN
DBSCAN is a density-based, non-hierarchical clustering method that

uses only two parameters:
• which is a reachability distance (a radius)
• M inP ts the minimum number of points (or training examples)
required to form a cluster
These two parameters are for tuning the method and have to be fixed by
the user.
The goal is to separate high density regions (determined by core points)
and low density regions (determined by noise points) in the feature space.

DBSCAN: illustration
Figure 3: Illustration of DBSCAN (Wikipedia). Red points represent a high-density region, the
blue point represent a low-density region, yellow points represent the “frontiers” of their cluster.

DBSCAN: the algorithm
• Choose a random point (observation) x(i)

0 from the data.
• Construct the -neighborhood of this point (the set of points that are
(i)
at distance less than from x0 ).
I if there are at least M inP ts points in the -neighborhood, then it will
form a cluster (high-density region)
I otherwise, the points in the -neighborhood will be considered as
noise (low-density region)
All dense points found in an -neighborhood are added to the cluster.

Once no dense point is found (we will talk about border points), another
random point is chosen and the process is repeated to explore new clusters.

DBSCAN in Python
from sklearn.cluster import DBSCAN

import numpy as np
# create some data 6x2
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
# fit DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
# cluster labels
print(clustering.labels_)
# model info
print(clustering)

2.2. Clustering using binary trees
2.2 Clustering using binary trees

Clustering using unsupervised binary trees
CUBT is a top-down hierarchical clustering methods that works in 3 steps:

• growing the maximal tree: recursive binary partitioning
• pruning the tree (dissimilarity-based pruning)
• joining the leaves of the tree (alternative pruning)

Similarities with CART
CUBT has many similarities with CART:

• Efficiency
• Flexibility
• Interpretability
• Good convergence properties

Step 1: Growing the maximal tree
Let t be a tree node containing a set of observations in Rp . The child

nodes of t are denoted tL and tR , defined as follows:
tL = {x ∈ Rp |xj ≤ a} and tR = {x ∈ Rp |xj > a}
Let Xt = {x|x ∈ t}, αt = P (x ∈ t) and R(t) a heterogeneity measure

(deviance) of t, defined as:
R(t) = αt tr(cov(Xt ))
The best split of t is defined by the pair (j, a) ∈ {1, ...p} × R maximizing
∆(t, j, a) = R(t) − R(tL ) − R(tR )

Step 1: Growing the maximal tree
We denote S the initial training dataset, each node t is split recursively

until the following stopping criteria are verified:
• All observations in t are the same
• There are less than minsize observations in t
• ∆(t, j, a) < mindev × R(S)
The clustering tree represents the partition. Each leaf represents a cluster.

Step 2: Pruning the tree

We denote tL and tR the leaves obtained by splitting t.
Pruning criterion
If dδ (L, R) ≤ mindist then tL and tR are aggregated
dδ (L, R) is an empirical dissimilarity measure between tL and tR :
dδ (L, R) = max(d¯δL , d¯δR )
where ∀δ ∈ [0, 1]
δnL δnR
1 X 1 X
d¯δL = di and d¯δR = dj
δnL i=1 δnR j=1

Dissimilarity measure: illustration

Step 3: Joining the leaves
We aggregate leaves that are not issued from a same parent

Two joining criteria
Leaves are compared using:
1. ∆(tL , tR ) = R(tL ∪ tR ) − R(tL ) − R(tR )
2. ∆(tL , tR ) = dδ (L, R)
Let NL be the total number of leaves and K the expected number of classes.
∀(L, R) ∈ {1, ..., NL } and L 6= R we have (L̃, R̃) = argminL,R ∆(tL , tR )
Joining the leaves

tL̃ and tR̃ are replaced by their union tL̃ ∪ tR̃ and NL = NL − 1. Stop
when NL = K.

CUBT: pros and cons
Pros:
• Decisional method
• Interpretable clustering
• Extensions to other types of data (ordinal, nominal)
• Adapted to parallel computing
• Partition of the feature space, not only the training dataset
Cons:
• Same as CART
• Trees are unstable

2.3. Variable importance in CUBT
2.3 Variable importance in CUBT

Motivation and objectives
Motivation
• Feature selection
• Dimension reduction
• Missing data
Objectives
• Define variable importance in CUBT
• Analyze its stability
• Compare to other methods

Competitive splits
To compute the importance of a feature j, we define the competitive split
of a feature j0 in a node t.

Competitive splits
The probability that an observation is sent to the left node for both splits
is
#{tL ∩ t0L }
p(tL ∩ t0L ) =
nt
Given that an observation is in t, the probability that both splits sent it to

the left is
p(tL ∩ t0L )
pLL (s, sj ) =
p(t)
pRR can be defined equivalently

Surrogate splits and variable importance
We define an association measure between sj and s
p(s, sj ) = pLL + pRR
s̃j is a surrogate split of s if

p(s, s̃j ) = maxsj p(s, sj )
The importance of the variable j is given by
X
Imp(Xj ) = ∆(R(s̃j , t))
t
which is the loss of deviance induced if each node is replaced by the

surrogate split defined on Xj .

Conclusion
• CUBT is an interpretable clustering method

• Measure of variable importance in CUBT
• Heuristics have been proposed for tuning the method
• Stability of variable importance
How to use CUBT ? (R users only. . . Python version in development)
What about clustering time-series ???
Let’s check a recent work of hierarchical clustering of time series, in
the field of epidemiology, with my colleague Sokhna Dieng (PhD Student
in Statistics, EHESP/SESSTIM).

2.4. Topic modelling using Latent Dirichlet Allocation (LDA)
2.4 Topic modelling using Latent Dirichlet

Allocation (LDA)

Context: Natural Language Processing (NLP)
• Aim: find topics in documents (useful for search, browsing,

information retrieval, NLP).
• Problem: no supplementary information (target variable y) about the
documents is available, just the text: Unsupervised task.
• Fuzzy clustering problem: a document (observation) can be
assigned to several topics (clusters): for example a scientific article
related to both finance and machine learning. . .
• (Best) method: LDA

Some assumptions of LDA
• a document can be related to multiple topics = an observation

can be assigned to more than one cluster.
• LDA is a type of probabilistic model called a generative process (a
document is generated using this process).
• A topic is a distribution generated over a mixture of words (a topic
is generated before the documents using this process).
• The main tuning parameter you have to choose is K, the number
of topics (as in K-means !!!).
We will see that other tuning parameters take place in applications in
Python (see the notebook attached to this session !).

LDA: a generative process
How does LDA do to generate a document ?

1. Randomly choose a topic distribution (a vector of K probabilities)
2. For each word:
I randomly choose a topic from the topic distribution
I randomly choose a word from this topic (which is itself a distribution
of words/tokens)
Note: words are independent from others (unigram bag-of-words model):

that’s why LDA is the not the best method.

LDA: notations and probabilistic approach
• β1:K correspond to topics, where ∀k, βk is a vector of probabilities

(one probability for each word/token)
• θd are the topic proportions of document d (a vector of probabilities)
• θd,k is the proportion of topic k in document d
• zd are the topic assignments for document d
• zd,n is the topic assignment for word n in document d
• wd are the observed words in document d
The generative process corresponds to the following joint probability:
K
Y D
Y N
Y
p(β1:K , θ1:D , z1:D , w1:D ) = p(βi ) p(θd ) p(zd,n |θd )p(wd,n |β1:K , zd,n )
i=1 d=1 n=1

LDA: illustration (traditional path diagram)
𝛼 𝜃𝑑 𝑧𝑑,𝑛 𝑤 𝑑,𝑛 𝛽𝑘 𝛿
𝑁
𝐷 𝐾
Figure 4: Traditional path diagram to illustrate LDA (inspired from Blei et al.). rows represent
conditional probabilities that are used for the generative process. Rectangles represent the
replications of the process. The blue node corresponds to observed variables (words).
Remarks about parameter estimation
• The variables that will be interesting for interpreting of the results are:
I βk the vector of word probabilities for topic k
I θdk the topic proportion for topic k in document d
• The generation process uses two usual probability distribution (check
functions in numpy.random):
I multinomial distribution
I Dirichlet distribution
• Parameter etimation is based on Gibbs sampling
At each iteration of the algorithm, we get updated values of βk and θdk .
The number of iterations (passes) is chosen by the user.

Probability estimates using Gibbs sampling
Some notations:
• zi is the topic assigned to token i in the corpus
• di is the document containing token i
• wi is the observed token (word)
• z−1 is the topics assigned to other tokens
Then we have:
WT
Cwij
+δ CdDT
ij
+α
P (zi = j|z−1 , wi , di , α, δ) = PW PT
w=1
WT + Wδ
Cwij t=1 CdDT
it
+ Tα
where C W T and C DT are matrices of counts (for word-topic pairs and

document-topic pairs).

Probability estimates using Gibbs sampling
The parameters of interest will be estimated as follows:
WT
Cwij
+δ
βik = PW
WT + Wδ
Cw
w=1 ij
CdDT
ij
+α
θdj = PT
t=1 CdDT
it
+ Tα
And now. . . let’s try LDA in Python !


Session6 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session6 PDF

Uploaded by

Copyright:

Available Formats

Prediction methods and Machine learning

Pierre Michel Prediction methods and Machine learning 2/102

1. Complements on Unsupervised Learning

Pierre Michel Prediction methods and Machine learning 3/102

1.1 Supervised versus Unsupervised Learning

Pierre Michel Prediction methods and Machine learning 4/102

Supervised versus Unsupervised Learning

What are the differences ?

Pierre Michel Prediction methods and Machine learning 5/102

1.2 Density estimation

Pierre Michel Prediction methods and Machine learning 8/102

Density estimation is the unsupervised version of regression.

Pierre Michel Prediction methods and Machine learning 9/102

Parametric estimator: monovariable example (2 densities)

One density Mixture of two densities

This problem can also be interpreted as a 2-cluster clustering task (see

Monovariable example in Python: dealing with histograms

bins = np.linspace(0, 8, 100)

plt.hist(x1, bins, alpha=0.5, label='x1')

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

Monovariable example in Python: dealing with histograms

Pierre Michel Prediction methods and Machine learning 12/102

where h is the bandwidth parameter, controlling the width of the breaks

Pierre Michel Prediction methods and Machine learning 14/102

Clustering aims to partition the data. The data represents a set of

Pierre Michel Prediction methods and Machine learning 15/102

Different applications of clustering

Pierre Michel Prediction methods and Machine learning 16/102

Clutering: which type of data can you cluster ?

Any type of data can be clustered.

Pierre Michel Prediction methods and Machine learning 17/102

Consider a set of m observations (x(1) , x(2) , ..., x(m) ), ∀i, x(i) ∈ Rn

Pierre Michel Prediction methods and Machine learning 18/102

Choosing a dissimilarity measure

• Depends on variables’ type. Generally we use the euclidean

• quantitative variables, using norm:

d(x(i) , x(j) ) = ||x(i) − x(j) ||2

Pierre Michel Prediction methods and Machine learning 19/102

Homogeneity criterion (inertia)

Note: g is the barycenter of the m observations.

Total inertia I is the sum of within-cluster inertia and between-cluster

The number of clusters K̃ is traditionally chosen using the following homo-

Pierre Michel Prediction methods and Machine learning 21/102

Homogeneity criterion: interpretation

• a cluster is even more homogeneous when its (within-)inertia is

Pierre Michel Prediction methods and Machine learning 22/102

Pierre Michel −0.5 0.0 0.5 1.0

Barycenter (gravity center)

Pierre Michel −0.5 0.0 0.5 1.0

Pierre Michel −0.5 0.0 0.5 1.0

Total inertia = Within-cluster inertia + Between-cluster

Pierre Michel Prediction methods and Machine learning 26/102

One obervation is denoted: x(i) ∈ Rn

Pierre Michel Prediction methods and Machine learning 27/102

c(i) = min ||x(i) − µk ||2

for k = 1 to K (update centroid step)

Pierre Michel Prediction methods and Machine learning 28/102

Non linear separability

0.5 1.5 2.5

Pierre Michel Prediction methods and Machine learning 29/102

Non linear separability

0.5 1.5 2.5