Professional Documents
Culture Documents
Session6 PDF
Session6 PDF
Session VI
Pierre Michel
pierre.michel@univ-amu.fr
M2 EBDS
2021
Pretty parametrization and visualization of a Neural
Network and applications in Econometrics
Playground Tensorflow
Let’s also check the work of Loann Desboulets (PhD Student in Economet-
rics, AMSE).
Supervised learning
(1) (1)
Training set: (x , y ), (x(2) , y (2) ), (x(3) , y (3) ), ..., (x(m) , y (m) )
3.0
2.0
x2
1.0
0.0
Pierre Michel
0.0 0.5 1.0 1.5 2.0Prediction2.5 3.0
methods and Machine learning 6/102
1. Complements on Unsupervised Learning
1.1. Supervised versus Unsupervised Learning
Unsupervised learning
(1) (2) (3)
Training set: x , x , x , ..., x(m)
3.0
2.0
x2
1.0
0.0
Pierre Michel
0.0 0.5 1.0 1.5 2.0Prediction2.5 3.0
methods and Machine learning 7/102
1. Complements on Unsupervised Learning
1.2. Density estimation
Density estimators
K
X
f (x) = αk fk (x)
k=1
200
250
200
150
Frequency
150
Frequency
100
100
50
50
0
0
0 2 4 6 8 10 0 2 4 6 8 10
Variable Variable
x1 = np.random.normal(4, 1, 1000)
x2 = np.random.normal(6, 1, 1000)
Density estimators
• Non-parametric estimators: using histograms or more generally
Kernel Density Estimators, select a kernel function κ and
consider the following estimator:
m
x − x(i)
1 X
f (x) = κ
mh i=1 h
1.3 Clustering
What is clustering ?
• Market segmentation
• Clinical medicine
• Social networtk analysis
• Cluster computing
• Astronomical data analysis
• Genetical data analysis
• ...
Dissimilarity measures
K
X mk
K X
X
Iw = Ik = pk ||x(i) − gk ||2 1{x(i) ∈Ck }
k=1 k=1 i=1
• Between-cluster inertia
K
X
Ib = pk ||gk − g||2
k=1
Huygens’ theorem
I = Iw + Ib
K̃ = min Iw
K>0
I = Iw + Ib
Scatterplot
1.0
0.5
0.0
−0.5
−1.0
Total inertia
1.0
0.5
0.0
−0.5
−1.0
K-means algorithm
Input:
• K (number of clusters)
• Training set: x(1) , x(2) , x(3) , ..., x(m)
K-means algorithm
Randomy initialize K cluster centroids µ1 , µ2 , ..., µK ∈ Rn
Repeat
for i = 1 to m (cluster assignment step)
c(i) := index(from 1 to K) of cluster centroid closest to x(i)
1 X
µk = x(i)
#{i : c(i) = k}
{i:c(i) =k}
Cluster separability
Height
x2
x1 Weight
Cluster separability
Height
x2
1.5
S
M
L
0.5
x1 Weight
m
1 X (i)
J(c(1) , ..., c(m) , µ1 , ..., µK ) = ||x − µc(i) ||2
m i=1
Minimization problem:
m
1 X (i)
min ||x − µc(i) ||2
(1) (2) (m) m
c ,c ,··· ,c ; i=1
µ1 ,µ2 ,··· ,µK
Repeated K-means
For i = 1 to 100*
Randomly initialize K-means
Run K-means algorithm.
Get c(1) , ..., c(m) , µ1 , ..., µK
Compute cost function
Finally pick clustering that gave lowest cost J(c(1) , ..., c(m) , µ1 , ..., µK )
m X
X K
Iw = ||x(i) − µk ||2 1{c(i) =k}
i=1 k=1
K-means: illustration
3.0
2.5
2.0
K-means: illustration
3.0
2.5
2.0
400
300
200
100
0
Pierre Michel 5 10 15
Prediction 20
methods and Machine learning 37/102
1. Complements on Unsupervised Learning
1.3. Clustering
Scatterplot
0.5
0.0
−0.5
−1.0
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 38/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 39/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 40/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 41/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 42/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 43/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Pierre Michel 0.5 0.6 0.7 0.8 0.9 and Machine learning 44/102
Prediction methods
1. Complements on Unsupervised Learning
1.3. Clustering
Extensions of K-means
• Pros
I This algorithm reduces within-cluster inertia at each step: it
converges
I Few iterations needed
• Cons
I instable: the partition obtained depends on the initialization: run
K-means several times. . .
I number of clusters K fixed by the user: simulations, principal
component analysis. . .
K-means in Python
Hierachical clustering
0.8
0.6
y
●
●
0.4
● 1●
●
●
0.8
0.6
y
●
●
0.4
● 1●2●
●
0.8
3
0.6
y
●
●
0.4
● 1●2●
●
0.8
3
0.6
y
●
●
4
0.4
● 1●2●
●
0.8
3
0.6
y
●
●
4
0.4 5
● 1●2●
●
0.8 6
3
0.6
y
●
●
4
0.4 5
● 1●2●
●
0.8 6
●
7
3
0.6
y
●
●
4
0.4 5
● 1●2●
●
0.8 6
●
7
3
0.6
y
●
●
4
0.4 5
● 8 1●2●
●
• Single linkage
• Complete linkage
• Ward’s method
pA pB 2
∆(A, B) = d (gA , gB )
pA + pB
3
2
2
1
1
y1
y1
y1
0
0
−1
−1
−1
−2
−2
−2
−3
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x1 x1 x1
36
13
2
46
26
10
35
7
3
4
48
25
12
30
31
42
23
1. Complements on Unsupervised Learning
14
43
9
39
5
38
50
8
40
28
29
41
1
18
44
24
27
45
47
20
22
33
34
15
16
6
19
21
32
17
37
11
49
141
145
125
121
144
101
116
137
149
112
104
117
138
109
105
129
133
78
111
148
113
140
142
146
108
131
103
126
130
119
106
123
118
132
110
136
61
99
58
94
63
68
83
93
65
80
60
54
90
Cluster Dendrogram
70
81
82
107
95
100
89
96
Dendrogram: example with iris dataset
97
67
85
56
91
115
122
114
102
143
120
69
88
135
147
124
127
73
84
134
87
51
53
66
76
77
55
59
74
79
64
92
75
98
62
72
150
71
128
139
86
52
57
Prediction methods and Machine learning 61/102
1. Complements on Unsupervised Learning
1.3. Clustering
x1 = np.random.normal(4, 1, 4)
x2 = np.random.normal(6, 1, 4)
y1 = np.random.normal(4, 1, 4)
y2 = np.random.normal(6, 1, 4)
plt.scatter(x1,y1)
plt.scatter(x2,y2)
plt.show()
x = np.append(x1,x2)
y = np.append(y1,y2)
dat = np.c_[x,y]
dist = pdist(dat, metric = "euclidean")
print(squareform(dist).shape)
P (xj = 1) = q
1−q
P (xj = x) = ∀x 6= 1
l−1
The same is done for the other clusters. A good choice for q would be 0.8
(high frequency).
K = 4, p = 3, l = 4.
The only difference with previous model is that the levels are not uniformly
distributed in each cluster.
Let’s consider a parameter p0 controlling for the non-uniformity of levels
distribution, for example set p0 = 0.8 and define the clusters as follows:
• C1 : x1 and x2 have odd levels with P (x1 = 1) = P (x2 = 1) = p0 ,
x3 is random
• C2 : x1 has odd levels, x2 has even levels with
P (x1 = 1) = P (x2 = 2) = p0 , x3 is random
• C3 : x1 has even levels, x3 has odd levels, with
P (x1 = 2) = P (x3 = 1) = p0 , x2 is random
• C4 : x1 and x3 have even levels with P (x1 = 2) = P (x3 = 2) = p0 ,
x2 is random
K
1 X XX XX
CU (C) = P (Ck ) P (fj = vjl |Ck )2 − P (fj = vjl )2
K j j
k=1 l l
m
1 X
ME = min 1{yi 6=σ(ŷi )}
σ∈Σ m i=1
Can empirically solve the label switching curse, typical to clustering tasks.
DBSCAN
DBSCAN: illustration
Figure 3: Illustration of DBSCAN (Wikipedia). Red points represent a high-density region, the
blue point represent a low-density region, yellow points represent the “frontiers” of their cluster.
DBSCAN in Python
R(t) = αt tr(cov(Xt ))
The best split of t is defined by the pair (j, a) ∈ {1, ...p} × R maximizing
where ∀δ ∈ [0, 1]
δnL δnR
1 X 1 X
d¯δL = di and d¯δR = dj
δnL i=1 δnR j=1
Let NL be the total number of leaves and K the expected number of classes.
∀(L, R) ∈ {1, ..., NL } and L 6= R we have (L̃, R̃) = argminL,R ∆(tL , tR )
Pros:
• Decisional method
• Interpretable clustering
• Extensions to other types of data (ordinal, nominal)
• Adapted to parallel computing
• Partition of the feature space, not only the training dataset
Cons:
• Same as CART
• Trees are unstable
Motivation
• Feature selection
• Dimension reduction
• Missing data
Objectives
• Define variable importance in CUBT
• Analyze its stability
• Compare to other methods
Competitive splits
To compute the importance of a feature j, we define the competitive split
of a feature j0 in a node t.
Competitive splits
The probability that an observation is sent to the left node for both splits
is
#{tL ∩ t0L }
p(tL ∩ t0L ) =
nt
p(tL ∩ t0L )
pLL (s, sj ) =
p(t)
X
Imp(Xj ) = ∆(R(s̃j , t))
t
Conclusion
K
Y D
Y N
Y
p(β1:K , θ1:D , z1:D , w1:D ) = p(βi ) p(θd ) p(zd,n |θd )p(wd,n |β1:K , zd,n )
i=1 d=1 n=1
𝛼 𝜃𝑑 𝑧𝑑,𝑛 𝑤 𝑑,𝑛 𝛽𝑘 𝛿
𝑁
𝐷 𝐾
Figure 4: Traditional path diagram to illustrate LDA (inspired from Blei et al.). rows represent
conditional probabilities that are used for the generative process. Rectangles represent the
replications of the process. The blue node corresponds to observed variables (words).
Pierre Michel Prediction methods and Machine learning 98/102
2. Recent approaches in clustering
2.4. Topic modelling using Latent Dirichlet Allocation (LDA)
• The variables that will be interesting for interpreting of the results are:
I βk the vector of word probabilities for topic k
I θdk the topic proportion for topic k in document d
• The generation process uses two usual probability distribution (check
functions in numpy.random):
I multinomial distribution
I Dirichlet distribution
• Parameter etimation is based on Gibbs sampling
At each iteration of the algorithm, we get updated values of βk and θdk .
The number of iterations (passes) is chosen by the user.
Some notations:
• zi is the topic assigned to token i in the corpus
• di is the document containing token i
• wi is the observed token (word)
• z−1 is the topics assigned to other tokens
Then we have:
WT
Cwij
+δ CdDT
ij
+α
P (zi = j|z−1 , wi , di , α, δ) = PW PT
w=1
WT + Wδ
Cwij t=1 CdDT
it
+ Tα
WT
Cwij
+δ
βik = PW
WT + Wδ
Cw
w=1 ij
CdDT
ij
+α
θdj = PT
t=1 CdDT
it
+ Tα