Presentation PDF

Deep Learning book, by Ian Goodfellow,
Yoshua Bengio and Aaron Courville

Chapter 6 :Deep Feedforward Networks
Benoit Massé Dionyssos Kounades-Bastian
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25

Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b
Prediction : y = W> x + b

Input vector x
Output vector y
W b
x u y

Input vector x
Output vector y
x1 W11 . . . W23 b1
u1 y1
x2 b2
u2 y2
x3
Input vector x
Output vector y
x1 W, b
y1
x2
y2
x3
Advantages
Easy to use
Easy to train, low risk of overtting
Drawbacks
Some problems are inherently non-linear

Solving XOR

Solving XOR
Linear regressor
x1 W, b
y
x2
There is no value for W and b
such that ∀(x1 , x2 ) ∈ {0, 1}2

x1
W >
+ b = xor (x1 , x2 )
x2

Solving XOR
What about... ?
W, b
x1 u1 V, c
y
x2 u2

Solving XOR
What about... ?
W, b
x1 u1 V, c
y
x2 u2
Strictly equivalent :
The composition of two linear operation is still a linear
operation

Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }
It is possible !
1 1 0 1

With W = ,b= ,V= and c = 0,
1 1 −1 −2
Vφ(Wx + b) = xor (x1 , x2 )

Neural network with one hidden layer
Compact representation
W , b, φ V, c
x h y
Neural network
Hidden layer with non-linearity
→ can represent broader class of function

Universal approximation theorem
Theorem
A neural network with one hidden layer can approximate any
continuous function
More formally, given a continuous function f : Cn 7→ Rm where

n
Cn is a compact subset of R ,
K
:x→ vi φ(wi > x + bi ) + c
X
ε
∀ε, ∃fNN
i =1
such that
ε
∀x ∈ Cn , ||f (x ) − fNN (x )|| < ε

Problems
Obtaining the network

The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b

Problems
Obtaining the network

The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b
Using the network
Even if we nd a way to obtain the network, the size of the
hidden layer may be prohibitively large.

Deep neural network
Why Deep ?
Let's stack l hidden layers one after the other; l is called the
length of the network.
W 1 , b1 , φ W 2 , b2 , φ W l , bl , φ V, c
x h1 ... hl y
Properties of DNN
The universal approximation theorem also apply
Some functions can be approximated by a DNN with N
hidden unit, and would require O(e N ) hidden units to be
represented by a shallow network.

Summary
Comparison
Linear classier
− Limited representational power
+ Simple
Shallow Neural network (Exactly one hidden layer)

+ Unlimited representational power
− Sometimes prohibitively wide
Deep Neural network

+ Relatively small number of hidden units needed

Summary
Comparison
Linear classier
− Limited representational power
+ Simple
Shallow Neural network (Exactly one hidden layer)

− Sometimes prohibitively wide
Deep Neural network

+ Relatively small number of hidden units needed
Remaining problem
How to get this DNN ?

On the path of getting my own DNN
Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

On the path of getting my own DNN
Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The output unit
Parameters
When the architecture is dened, we need to train the DNN
W1 , b1 , . . . , Wl , bl

Hyperparameters
The depth l
Strongly depend on the problem to solve

Hyperparameters
The depth l

g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0
Sigmoid 1
− −
Many others : tanh, RBF, softplus...

Hyperparameters
The depth l

g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0
Sigmoid 1
− −
Many others : tanh, RBF, softplus...
The output unit

Linear outputE[y] = V> hl + c
For regression with Gaussian distribution y ∼ N (E[y], I)
ŷ
Sigmoid output = σ(w> hl + ) b
For classication with Bernouilli distribution P (y = 1|x) = ŷ

Parameters Training
Objective
Let's dene θ = (W1 , b1 , . . . , Wl , bl ).
We suppose we have a set of inputs X = (x1 , . . . , xN ) and a
set of expected outputs Y = (y1 , . . . , yN ). The goal is to nd
a neural network fNN such that
∀i , fNN (x i , θ) ' yi .

Parameters Training
Cost function
To evaluate the error that our current network makes, let's
dene a cost function L(X, Y, θ). The goal becomes to nd
argmin L(X, Y, θ)
θ
Loss function
Should represent a combination of the distances between every
yi and the corresponding fNN (xi , θ)
Mean square error (rare)
Cross-entropy

Parameters Training
Find the minimum

The basic idea consists in computing θ̂ such that
∇θ L(X, Y, θ̂) = 0.
This is dicult to solve analytically e.g. when θ have millions

of degrees of freedom.

Parameters Training
Gradient descent
Let's use a numerical way to optimize θ, called the gradient
descent (section 4.3). The idea is that
f (θ − εu) ' f (θ) − εu> ∇f (θ)
So if we take u = ∇f (θ), we have u> u > 0 and then

f (θ − εu) ' f (θ) − εu> u < f (θ).
If f is a function to minimize, we have an update rule that

improves our estimate.

Parameters Training
Gradient descent algorithm

1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

Parameters Training
Gradient descent algorithm

1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold
Problem
How to estimate eciently ∇θ L(X, Y, θ̂) ?
Back-propagation algorithm

Back-propagation for Parameter Learning
Consider the architecture:

w 2 w 1
y φ φ
x
with function:

y =φ w2 φ(w1 x ) ,
N
some training pairs T = x̂n , ŷn n=1 , and

an activation-function φ().
Learn w1 , w2 so that: Feeding x̂n results ŷn .

Prerequisite: dierentiable activation function
For learning to be possible φ() has to be dierentiable.

Let φ0 (x ) = ∂φ(x ) denote the derivative of φ(x ).
∂x
For example when φ(x ) = Relu(x ) we have:

Gradient-based Learning
Minimize the loss function L(w1 , w2 , T ).

We will learn the weights by iterating:
 
updated ∂L
 ∂ w1 

w1 w1
= −γ , (1)
w2 w2 ∂L
∂ w2
L is the loss function (must be dierentiable): In detail is

L(w1 , w2 , T ) and we want to compute the gradient(s) at
w1 , w2 .
γ is the learning rate (a scalar typically known).

Back-propagation
Calculate intermediate The partial derivatives are:

values on all units:
∂L(d )
∂d = L0 ( ).d
a = w1x̂n
6
1 .
∂d
= φ0 ( w2φ(w1x̂n ))
b = φ(w1x̂n ) ∂c
7 .
.
∂c
w2
2
c = w2φ(w1x̂n )
8
∂b = .
w1x̂n )
3 .
∂b
d = φ w2φ(w1x̂n ) ∂a = φ0 ( .
9
4 .
L(d ) = L φ w2 φ(w1 x̂n )

5 .

Calculating the Gradients I
Apply chain rule:

∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1
∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

Calculating the Gradients I
Apply chain rule:

∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1
∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2
Start the calculation from left-to-right.

We propage the gradients (partial products) from the last
layer towards the input.

Calculating the Gradients
And because we have N training pairs:

N
∂L X ∂L(dn ) ∂ dn ∂ cn ∂ bn ∂ an
= ,
∂ w1 n=1 ∂ dn ∂ cn ∂ bn ∂ an w1
N
∂L X ∂L(dn ) ∂ dn ∂ cn
= .
∂ w2 n=1 ∂ dn ∂ cn w2

Thank you !

Presentation PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation PDF

Uploaded by

Copyright:

Available Formats

Deep Learning book, by Ian Goodfellow,

Yoshua Bengio and Aaron Courville

Benoit Massé Dionyssos Kounades-Bastian

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25

Vφ(Wx + b) = xor (x1 , x2 )

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25

More formally, given a continuous function f : Cn 7→ Rm where

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25

Obtaining the network

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25

Obtaining the network

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25

Shallow Neural network (Exactly one hidden layer)

− Sometimes prohibitively wide

Deep Neural network

+ Relatively small number of hidden units needed

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25

Shallow Neural network (Exactly one hidden layer)

− Sometimes prohibitively wide

Deep Neural network

+ Relatively small number of hidden units needed

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

The activation function φ

Many others : tanh, RBF, softplus...

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

The activation function φ

Many others : tanh, RBF, softplus...

The output unit

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25

Find the minimum

This is dicult to solve analytically e.g. when θ have millions

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25

So if we take u = ∇f (θ), we have u> u > 0 and then

If f is a function to minimize, we have an update rule that

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25

Gradient descent algorithm

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25

Gradient descent algorithm

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25

Consider the architecture:

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25

For learning to be possible φ() has to be dierentiable.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25

Minimize the loss function L(w1 , w2 , T ).

L is the loss function (must be dierentiable): In detail is

γ is the learning rate (a scalar typically known).

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 21 / 25

Calculate intermediate The partial derivatives are:

L(d ) = L φ w2 φ(w1 x̂n )

This is dicult to solve analytically e.g. when θ have millions

For learning to be possible φ() has to be dierentiable.

L is the loss function (must be dierentiable): In detail is