You are on page 1of 134

First steps to deep-learning

From the perceptron to the deep generative models

Ferdinand Bhavsar, Nicolas Desassis, Fabien Ors


Thomas Romary

nicolas.desassis@minesparis.psl.eu

January 2022

1/57
Realistic simulations of meandering system: Flumy

2/57
Realistic simulations of meandering system: Flumy

Flumy is a process-based model for meandering channels in


fluvial environment
It provides very realistic simulations
High computational cost
Parameterization is not obvious (black-box)
How to perform conditional simulations?
Can we approximate Flumy with a geostatistical model?

3/57
Realistic simulations of meandering system: Flumy

Flumy is a process-based model for meandering channels in


fluvial environment
It provides very realistic simulations
High computational cost
Parameterization is not obvious (black-box)
How to perform conditional simulations?
Can we approximate Flumy with a geostatistical model?
Generative models of deep-learning?

3/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks

4/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).

4/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).
In genereral, NN have a high-capacity or a great expressive
power given that the number of weights is important (see
later)

4/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).
In genereral, NN have a high-capacity or a great expressive
power given that the number of weights is important (see
later)
Thank to generic algorithms (gradient based), NN can help to
solve problem expressed as optimization of a functional:

argmin T (f ) ≃ arg minm T (fW )


f ∈E W∈R

where T is a functional defined over a space E and fW is a


neural-network parameterized with m weights stored in W

4/57
Examples

Classification

5/57
Examples

Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.

5/57
Examples

Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images

5/57
Examples

Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images
Partial Differential Equation resolution

5/57
Examples

Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images
Partial Differential Equation resolution
Find a random process to generate images with a distribution
"close" the one of the training images

5/57
Some applications

Data-sciences
Computer vision
Predictive maintenance
Medical diagnoses Click prediction
Facial recognition Default payement prediction
Handwritting recognition ...
...
Automatons
Natural langage processing (NLP)
Chess and Go
Chatbots Generation of art
Automatic translation (music, pictures, text,...)
Spam detection ...
...
...

6/57
Some reasons for such a success

Big Data
New algorithms
Python
Google & Facebook
− > TensorFlow, Pytorch
New computational facilities
− > cloud, GPU, TPU,...
Open source codes
Blogs
MOOCs
Impressive successes

7/57
Outline

What is a neural network?


How to train a neural network?
Generative models

8/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features  
x1,i
 
xi =  ... 
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label

Aim
Predict the value of y for a new individual, knowing its features x

9/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features  
x1,i
 
xi =  ... 
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label

Aim
Predict the value of y for a new individual, knowing its features x

Regression: the components of y are continuous


Classification: y is categorical

9/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features  
x1,i
 
xi =  ... 
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label

Aim
Predict the value of y for a new individual, knowing its features x

Regression: the components of y are continuous


Classification: y is categorical
Binary classification: m = 1 and y ∈ {0, 1}
Multiclass classification: y ∈ {1, . . . , K }
9/57
Matricial representation
Matrix of the features for the N individuals
 
x1,1 . . . x1,N
 .. 
X =  ... ..
. . 
xn,1 . . . xn,N

10/57
Matricial representation
Matrix of the features for the N individuals
 
x1,1 . . . x1,N
 .. 
X =  ... ..
. . 
xn,1 . . . xn,N

Matrix of labels
 
y1,1 . . . y1,N
 ... 
Y =  ... ..
. 
ym,1 . . . ym,N

10/57
Matricial representation
Matrix of the features for the N individuals
 
x1,1 . . . x1,N
 .. 
X =  ... ..
. . 
xn,1 . . . xn,N

Matrix of labels
 
y1,1 . . . y1,N
 ... 
Y =  ... ..
. 
ym,1 . . . ym,N

When y is categorical with K categories (with K > 2), we


work with the indicators of each category (one-hot encoding)

yk,i = 1yi =k

and m = K
10/57
Example: binary classification
Input Hidden Output
layer layer layer

0.98 This is a cat

Find W such as all the fW (xi )’s of a training set are "close" to
the associated labels yi ’s where
xi is the vector of features of the i th image (pixel values)
yi is the label (cat or not cat)

11/57
Artificial neural
A first model: Mc Culloch and Pitts (1943)

12/57
Artificial neural
A first model: Mc Culloch and Pitts (1943)

1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

n
x2

X 1 if s≥θ
ω2 s= ωi xi h(s) =
h 0 if s<θ
k=1
1

s h(s)
ωn−1
0

ωn
xn−1 Inputs xi ∈ {0, 1}
Weights ωi
Step function h
xn Threshold θ
Output h(s) ∈ {0, 1}
12/57
Implementing logical gates
NOT


1 if s≥θ−0
s = ω1 x1 h(s) =
h 0 if s < θ−0
ω1

1
x1 s h(s)

0
θ

x1 h(s)
0 1
1 0

13/57
Implementing logical gates
NOT


1 if s ≥ −0.5
s = −x1 h(s) =
h 0 if s < −0.5
−1

1
x1 s h(s)

0
θ

x1 h(s)
0 1
1 0

13/57
Implementing logical gates
AND

x1

1 if s ≥ 1.2−
1 s = x1 + x2 h(s) =
h 0 if s < 1.2−

1
s h(s)
1

0
θ

x2 x1 x2 h(s)

1
0 0 0 θ=1.2

0 1 0
1 0 0
1 1 1

0
0 1

13/57
Implementing logical gates
OR

x1

1 if s ≥ 0.8−
1 s = x1 + x2 h(s) =
h 0 if s < 0.8−

1
s h(s)
1

0
θ

x2 x1 x2 h(s)

1
0 0 0
0 1 1 θ=0.8
1 0 1
1 1 1

0
0 1

13/57
Implementing logical gates
XOR?

x1

1 if s ≥ θ−.8
ω1 s = ω1 x1 + ω2 x2 h(s) =
h 0 if s < θ−.8

1
s h(s)
ω2

0
θ

x2 x1 x2 h(s)

1
0 0 0
0 1 1
1 0 1
1 1 0

0
0 1

13/57
Implementing logical gates
XOR?

x1

1 if s ≥ θ−.8
ω1 s = ω1 x1 + ω2 x2 h(s) =
h 0 if s < θ−.8

1
s h(s)
ω2

0
θ

x2 x1 x2 h(s)

1
0 0 0
0 1 1
1 0 1
1 1 0

0
0 1

The problem is not linearly separable


13/57
Idea: combine neurons

x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )

14/57
Idea: combine neurons

x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )

x1 ∧

x2 ∨

x1 x2 h(s)

1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1

14/57
Idea: combine neurons

x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )

x1 ∧

x2 ∨

x1 x2 h(s)

1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1

14/57
A first theorem on the capacity

0 or 1

For every integer n, one can build a graph which contains all
the functions from {0, 1}n to {0, 1}.

15/57
A first theorem on the capacity

0 or 1

For every integer n, one can build a graph which contains all
the functions from {0, 1}n to {0, 1}.
Unfortunately, the number of necessary neurons in the
intermediate layer grows exponentially with n
15/57
First steps: the perceptron

Rosenblatt, F. (1957) The perceptron. A perceiving and recognizing automaton.


Cornell aeronautical laboratory, Inc.

16/57
The perceptron

1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

n
x2

X 1 if s≥θ
ω2 s= ωi xi h(s) =
h 0 if s<θ
k=1

1
s h(s)
ωn−1

0
θ

ωn
xn−1 Inputs xi ∈ R
Weights ωi
Step function h
xn Threshold θ
Output h(s) ∈ {0, 1}

17/57
The perceptron

1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

n
x2

X 1 if s≥0
ω2 s = ω0 .1 + ωi xi h(s) =
h 0 if s<0
k=1

1
s h(s)
ωn−1

0
0

ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Heavyside h
Output h(s) ∈ {0, 1}

17/57
The perceptron

1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

x2

1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0

1
s h(s)
ωn−1

0
0

ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Heavyside h
Output h(s) ∈ {0, 1}

17/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃

18/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
This algorithm converges in a finite number of iterations

18/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
This algorithm converges in a finite number of iterations
if a solution exists!

18/57
Implementation

19/57
Implementation

Method only adapted for linearly separable problems


19/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)

yi ∈ {0, 1} is a realization of a Bernoulli distribution with


probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
for a link function h
1
0

2.5 5.0 7.5


Chloroquine dose

20/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)

yi ∈ {0, 1} is a realization of a Bernoulli distribution with


probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
for a link function h
1
0

2.5 5.0 7.5


Chloroquine dose

20/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)

yi ∈ {0, 1} is a realization of a Bernoulli distribution with


probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
for a link function h

1
Probit model

h=G

the c.d.f of the Gaussian −5 0 5

20/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)

yi ∈ {0, 1} is a realization of a Bernoulli distribution with


probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
for a link function h

1
Probit model

h=G

the c.d.f of the Gaussian −5 0 5

1
Logistic model
1
h(x) =
1 + e −x
−5 0 5
20/57
1
 
[ ω0 ω1 ... ωn ] 1
ω0


x1 x1
| {z }  

ω̃ t
 x̃
 
 ..
ω1
 . 


xn

x2

1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0

1
s h(s)
ωn−1

0
0

ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Heavyside h
Output h(s) ∈ {0, 1}

21/57
1
 
[ ω0 ω1 ... ωn ] 1
ω0


x1 x1
| {z }  

ω̃ t
 x̃
 
 ..
ω1
 . 


xn

x2

1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0

1
s h(s)
ωn−1

0
0

ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Activation function h
Output h(s) ∈ (0, 1)

22/57
Inference by maximum-likelihood
The likelihood of observation i is

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi

23/57
Inference by maximum-likelihood
The likelihood of observation i is

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi


For N independant observations

Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1

23/57
Inference by maximum-likelihood
The likelihood of observation i is

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi


For N independant observations

Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1
It is equivalent to maximize the log-likelihood

X
N
[yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1

23/57
Inference by maximum-likelihood
The likelihood of observation i is

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi


For N independant observations

Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1
It is equivalent to maximize the log-likelihood

X
N
[yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
or to minimize the binary cross-entropy

X
N
L(ω̃) = − [yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
23/57
Neural network
Multi-Layer perceptron
A neural network is a graph in which outputs of neurons (nodes)
are become the inputs of some other neurons (oriented graph)

24/57
One layer
1

(l−1)
v1
(l)
s1

   
(l)

v1
(l)
 h(l) s1
(l−1) (l)
v2 s2  (l)
  
(l)

 v2   h(l) s2
  

 . = 
 .   .. 
 .
. 
  
(l) 
(l)

vnl h(l) snl
| {z }
v(l)
| {z }
(l)
snl h(l) (s(l) )
(l−1)
vnl−1

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
s1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )

 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1 (l)
ω1,0

(l)
(l−1) ω1,1
v1
(l)
s1
(l)
INPUT ω1,2 OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2  (l)  . =

 ..  ω1,nl−1  .   ..



.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s1 =ω1,0 .1 + ω1,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l)
ω2,0

(l−1)
v1 (l)
ω2,1 (l)
s1
INPUT OUPUT
   
(l) (l)
 (l−1)  ω2,2  (l)  h(l) s1
v1 (l−1) (l) v1
 (l−1)  v2 s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 

(l)
 
(l−1) ω2,nl−1 (l)
vnl

(l)

vnl−1 h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s2 =ω2,0 .1 + ω2,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l)
ωnl ,0

(l−1)
v1
(l)
s1
(l)
INPUT ωnl ,1 OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 (l)
s2 
 (l)  
 
(l)

 v2   h(l) s2

 v2  ωnl ,2  . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
ωnl ,nl−1 (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s(l)
nl =ωnl ,0 .1 + ωnl ,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1

(l−1)
v1
(l)
v1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 v2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
vnl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
si =ωi,0 .1 + ωi,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
2 attributes for layer l:
1) A weight matrix W̃ (l)
2) An activation function h(l)
(l−1)
v1
(l)
v1
INPUT OUPUT
   
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 v2 
 (l)  
 
(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
. 
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
vnl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
si =ωi,0 .1 + ωi,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
Summary

(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:

(l)
v(l) = fW̃ (l) (v(l−1) )

26/57
Summary

(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:

(l)
v(l) = fW̃ (l) (v(l−1) ) = h(l) (W̃ (l) ṽ(l−1) )

26/57
Summary

(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:

(l)
v(l) = fW̃ (l) (v(l−1) ) = h(l) (W̃ (l) ṽ(l−1) ) = h(l) (b(l) + W (l) v(l−1) )

 (l) (l)   (l) 


ω1,1 . . . ω1,n l−1
ω1,0
 .. .. .   . 
with W (l) = . . ..  and b(l) =  .. 
ωn(l)l ,1 . . . ωn(l)l ,nl−1 ωn(l)l ,0

26/57
Multi-Layer Perceptron

Feed the input vector x to the input layer:

v(0) = x

For l = 1, . . . , L compute
(l)
v(l) = fW̃ (l) (v(l−1) )

Return the content of the output layer

v(L) = fW (x)

v(1) , . . . , v(L−1) are the hidden layers

27/57
Capacity of a neural network
Universal approximation theorem (Hornik, 1991)
Let f be a continuous function f : [0, 1]n → R and ε > 0.
Let h (1) be a non-constant, increasing, bounded real function.
Then, there exists
an integer q,
a matrix of weights W̃ (1) ∈ Rq×n
a vector of weights ω (2) ∈ Rq
a real b(2)
such as for all x ∈ [0, 1]n ,
D  E
e⟩
f (x) − b(2) − ω (2) , h (1) ⟨W̃ (1) , x <ε

28/57
Capacity of a neural network
Universal approximation theorem (Hornik, 1991)
Let f be a continuous function f : [0, 1]n → R and ε > 0.
Let h (1) be a non-constant, increasing, bounded real function.
Then, there exists
an integer q,
a matrix of weights W̃ (1) ∈ Rq×n
a vector of weights ω (2) ∈ Rq
a real b(2)
such as for all x ∈ [0, 1]n ,
D  E
e⟩
f (x) − b(2) − ω (2) , h (1) ⟨W̃ (1) , x <ε

Any continuous real-valued function on a bounded subset of Rn


can be approximated arbitrarily well by a MLP with one hidden
layer (for appropriate activation function h (1) )
28/57
Loss functions

The cost function is a sum of loss functions δ between each


prediction fW (xi ) and each label yi :

X
N
L(W) = δ(yi , fW (xi ))
i=1

29/57
Loss functions

The cost function is a sum of loss functions δ between each


prediction fW (xi ) and each label yi :

X
N
L(W) = δ(yi , fW (xi ))
i=1

Regression
X
N
δ(y, fW (x)) = (yk − fW (x))2
k=1

Sum of squares
Likelihood for Gaussian noise
Simplifications occur when the activation function of the
output layer is the identity
29/57
Loss functions

The cost function is a sum of loss functions δ between each


prediction fW (xi ) and each label yi :

X
N
L(W) = δ(yi , fW (xi ))
i=1

Binary classification
δ(y, fW (x)) = −y log(fW (x)) − (1 − y) log(1 − fW (x))

Binary cross-entropy
Likelihood for Bernouilli
Simplifications occur when the activation function of the
output layer is the sigmoid (a.k.a logistic)
29/57
Loss functions

The cost function is a sum of loss functions δ between each


prediction fW (xi ) and each label yi :

X
N
L(W) = δ(yi , fW (xi ))
i=1

Multiclass classification
X
K
δ(y, fW (x)) = − yk log(fW (x)k )
k=1

Cross-entropy
Likelihood for Multinomial distribution
Simplifications occur when the activation function of the
output layer is the softmax
29/57
Softmax activation function

For a multiclass problem with K categories, one can use the


softmax activation on the last layer:
 s 
e1
1  . 
v = Softmax (s) = nL  .. 
X
e si e snL
i=1

30/57
Softmax activation function

For a multiclass problem with K categories, one can use the


softmax activation on the last layer:
 s 
e1
1  . 
v = Softmax (s) = nL  .. 
X
e si e snL
i=1

The result is an element of the simplex of dimension nL :

X
nL
vi = 1 and vi > 0
i=1

30/57
Training neural network: Gradient descent
The gradient
 dL 
dω0 (ω̃)
 .. 
∇L(ω̃) =  . 
dL
dωn (ω̃)

Credits: Martin Thoma

31/57
Training neural network: Gradient descent
The gradient
 dL 
dω0 (ω̃)
 .. 
Initialisation: choose ∇L(ω̃) =  . 
dL
ω̃(0) ∈ Rn+1 dωn (ω̃)

Iterate until convergence

η 
ω̃(t+1) = ω̃(t) − ∇L ω̃(t)
N
η > 0 is the learning rate
chosen by the user

Credits: Martin Thoma

31/57
Train a neural network: Back-propagation

Chain rule:

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

32/57
Train a neural network: Back-propagation

Chain rule:

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x)

32/57
Train a neural network: Back-propagation

Chain rule:

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x)

32/57
Train a neural network: Back-propagation

Chain rule:

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x) = f ′ (g(h(x)))g ′ (h(x))h ′ (x)

32/57
Train a neural network: Back-propagation

Chain rule:

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x) = f ′ (g(h(x)))g ′ (h(x))h ′ (x)

Leibniz notation
h g f
x −−→ y −−→ u −−→ z

dz dz du dy
=
dx du dy dx

32/57
Back-propagation

dL
(L)
=
dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL
(L−1)
=
dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL
(L−1)
=
dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL
(L−2)
=
dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL
(L−2)
=
dωi,j

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)

33/57
Back-propagation

dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)

33/57
Stochastic gradient and variants
Groupe de lecture - Mike Pereira
The evaluation of the loss function gradient can be costly
when the training set is large
Instead of computing the gradient on the entire training set, it
will be computed on mini-batches of moderate size nb (e.g
nb = 20 but can be 1).

W(t+1) = W(t) − η∇Lnb W(t)
The number of epochs is the number of times the entire data
set is visited
At the beginning of each epoch, the data-set is randomly
shuffled
The cost-function and its gradient are divided by nb (to make
the choice of η independent of nb )

34/57
Adaptive Moment Estimation (ADAM)
Kingma and Ba (2014)
Initialisation

m(0) = v(0) = 0, β1 , β2 ∈ [0, 1) e.g β1 = 0.9, β2 = 0.999

For t ≥ 1

m(t+1) = β1 m(t) + (1 − β1 )∇Lnb (W(t) )

v(t+1) = β2 v(t) + (1 − β2 )∇L2nb (W(t) )


m(t+1)
m̂(t+1) =
1 − β1t
v(t+1)
v̂(t+1) =
1 − β2t
m̂(t+1)
W(t+1) = W(t) − η p
[Tensorflow playground] v̂(t+1) + ε

35/57
Generative model: the problem

Aim
Given a (large?) set of training examples, generate new examples
which "look like" the training examples in terms of
fidelity or realism
diversity or variability

36/57
Generative model: the problem

Aim
Given a (large?) set of training examples, generate new examples
which "look like" the training examples in terms of
fidelity or realism
diversity or variability
In other words, given an empirical distribution PD over Rn , find a
probabilistic distribution (the model) PG with its simulation
algorithm (the generator) which is close to PD (or its theoretical
version P)

36/57
First idea: use a Gaussian transform approach
Generate a latent vector

z ∼ Nn (0, In )

Compute
x = G(z)
where G = fW is a neural network with weights W to
determine

37/57
First idea: use a Gaussian transform approach
Generate a latent vector

z ∼ Nn (0, In )

Compute
x = G(z)
where G = fW is a neural network with weights W to
determine
Mathematical justifications:
Most of the distributions of Rn can be written as a function
of a Gaussian vector of Rn with independant components
(inversion method + decomposition of multivariate densities
as a product of conditional distributions)
By using a high-capacity neural network G, we can
approximate the target function (universal approximation
theorem)
37/57
Generator

Latent Hidden Output


space layer layer

NOISE FAKE

38/57
Example of geostatistical deep generative model
Cholesky algorithm

Gaussian vector x with mean vector µ and covariance matrix Σ


Compute L (lower triangular) such as

Σ = LLT

Sample
z ∼ Nn (0, In )
Return
x = µ + Lz

39/57
Example of geostatistical deep generative model
Cholesky algorithm

Gaussian vector x with mean vector µ and covariance matrix Σ


Compute L (lower triangular) such as

Σ = LLT

Sample
z ∼ Nn (0, In )
Return
x = µ + Lz

Transformation can be handled by the activation function

39/57
Problem: How to handle the large number of parameters?

The data belongs to a manifold

[https://line.17qq.com/articles/cmmcohgcpv.html]

40/57
Example - Dimension reduction (or compression)

Starting from a set of examples x1 , . . . , xN , find two functions


an encoder fe from Rn to Rd with d much smaller than n
a decoder fd from Rd to Rn
The aim is to minimize the reconstruction error over all the
examples
X
N
T (fe , fd ) = ||xi − fd (fe (xi ))||2
i=1

41/57
Second idea: Dimension reduction

[https://www.compthree.com/blog/autoencoder/]

The network "learns" a low dimensional representation of the inputs

42/57
Example: compression of Flumy images

Original 64 × 64 Flumy images (left) and reconstructed


version (right) after a compression in a latent space of 121
units

43/57
Simulate in the latent space

44/57
Simulate in the latent space

44/57
Generator

Latent Hidden Output


space layer layer

NOISE FAKE

45/57
Generator

Latent Hidden Output


space layer layer

NOISE FAKE

How to train the network?

45/57
Generative Adversarial Network (GAN)
A two players game

REAL

Input Hidden Output


layer layer layer

Latent Hidden Output


space layer layer

REAL or FAKE?

NOISE FAKE

DISCRIMINATOR

GENERATOR

46/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)

47/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)

Idea: Train Gθ and Dϕ in order that Gθ misleads Dϕ and Dϕ tries to


become better to detect fakes

47/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)

Dϕ is close to 1 if the discriminator classifies an image as true and


close to 0 otherwise

47/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)

Minmax game with value function

min max V (Dϕ , Gθ )


Gθ Dϕ

with

V (Dϕ , Gθ ) = EX∼Data [log Dϕ (X )] + Ez∼N(0,Id ) [log(1 − Dϕ (Gθ (z))]

47/57
48/57
Training
Goodfellow et al. (2014)

for Number of training iterations do


for k states do
Sample minibatch of mb noise samples {z1 , . . . , zmb } ∼ N(0, Id )
Sample minibatch of mb examples {x1 , . . . , xmb } ∼ Data
Update the discriminator by ascending its stochastic gradient:

1 X
mb
∇ϕ [log Dϕ (xi ) + log(1 − Dϕ (Gθ (zi )))]
mb i=1

end for
Sample minibatch of mb noise samples {z1 , . . . , zmb } ∼ N(0, Id )
Update the generator by descending its stochastic gradient

1 X
mb
∇θ log(1 − Dϕ (Gθ (zi )))
mb i=1

end for

49/57
Flumy generator

Generator is trained on 10000 small 64 × 64 training images


The architecture of the generator (fully convolutional neural
network) leads to stationary process

Example of 64 × 64 Flumy generated images

50/57
Results
Horizontal section

Top: Flumy. Bottom: GAN

51/57
Example on Plurigaussian

4 images among 1000 training images

52/57
Local adaptation

53/57
Local adaptation

53/57
Local adaptation

53/57
Conditioning

Variational bayesian formulation


A new latent vector Zc
A conditioning generator Gc

Zc → Gc (Zc ) → X

Find the parameters of Gc such as G(Gc (Zc )) produces


outputs close to the data and consistent with the prior
(gaussian distribution of the latent space)

54/57
Results

Reference image

9 conditional simulations

55/57
Results

Reference image

Conditional probabilities

56/57
References

Chapelle, O., J. Weston, L. Bottou, and V. Vapnik (2001). Vicinal risk minimization. Advances in neural
information processing systems, 416–422.
Glorot, X. and Y. Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256.
JMLR Workshop and Conference Proceedings.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014).
Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 770–778.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), 251–257.
Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning, pp. 448–456. PMLR.
Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolutional neural
networks. Advances in neural information processing systems 25, 1097–1105.
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86 (11), 2278–2324.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Dropout: A simple way to
prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), 1929–1958.
Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich
(2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1–9.

57/57

You might also like