You are on page 1of 38

Deep Learning book, by Ian Goodfellow,

Yoshua Bengio and Aaron Courville


Chapter 6 :Deep Feedforward Networks

Benoit Massé Dionyssos Kounades-Bastian

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25


Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25


Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

W b
x u y

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25


Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

x1 W11 . . . W23 b1
u1 y1
x2 b2
u2 y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

x1 W, b
y1
x2
y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)

Advantages
Easy to use
Easy to train, low risk of overtting

Drawbacks
Some problems are inherently non-linear

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25


Solving XOR

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25


Solving XOR

Linear regressor

x1 W, b
y
x2
There is no value for W and b
such that ∀(x1 , x2 ) ∈ {0, 1}2
 
x1
W >
+ b = xor (x1 , x2 )
x2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25


Solving XOR

What about... ?
W, b
x1 u1 V, c
y
x2 u2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25


Solving XOR

What about... ?
W, b
x1 u1 V, c
y
x2 u2
Strictly equivalent :
The composition of two linear operation is still a linear
operation

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25


Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25


Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

It is possible !
1 1 0 1
     
With W = ,b= ,V= and c = 0,
1 1 −1 −2

Vφ(Wx + b) = xor (x1 , x2 )


Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25
Neural network with one hidden layer

Compact representation
W , b, φ V, c
x h y

Neural network
Hidden layer with non-linearity
→ can represent broader class of function

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25


Universal approximation theorem

Theorem
A neural network with one hidden layer can approximate any
continuous function

More formally, given a continuous function f : Cn 7→ Rm where


n
Cn is a compact subset of R ,

K
:x→ vi φ(wi > x + bi ) + c
X
ε
∀ε, ∃fNN
i =1
such that
ε
∀x ∈ Cn , ||f (x ) − fNN (x )|| < ε

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25


Problems

Obtaining the network


The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25


Problems

Obtaining the network


The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b
Using the network
Even if we nd a way to obtain the network, the size of the
hidden layer may be prohibitively large.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25


Deep neural network
Why Deep ?
Let's stack l hidden layers one after the other; l is called the
length of the network.

W 1 , b1 , φ W 2 , b2 , φ W l , bl , φ V, c
x h1 ... hl y

Properties of DNN
The universal approximation theorem also apply
Some functions can be approximated by a DNN with N
hidden unit, and would require O(e N ) hidden units to be
represented by a shallow network.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25


Summary
Comparison
Linear classier
− Limited representational power

+ Simple

Shallow Neural network (Exactly one hidden layer)


+ Unlimited representational power

− Sometimes prohibitively wide

Deep Neural network


+ Unlimited representational power

+ Relatively small number of hidden units needed

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25


Summary
Comparison
Linear classier
− Limited representational power

+ Simple

Shallow Neural network (Exactly one hidden layer)


+ Unlimited representational power

− Sometimes prohibitively wide

Deep Neural network


+ Unlimited representational power

+ Relatively small number of hidden units needed

Remaining problem
How to get this DNN ?

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25


On the path of getting my own DNN

Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25


On the path of getting my own DNN

Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

Parameters
When the architecture is dened, we need to train the DNN
W1 , b1 , . . . , Wl , bl

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25


Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25


Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

The activation function φ


g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0

Sigmoid 1
− −

Many others : tanh, RBF, softplus...

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25


Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

The activation function φ


g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0

Sigmoid 1
− −

Many others : tanh, RBF, softplus...

The output unit


Linear outputE[y] = V> hl + c
For regression with Gaussian distribution y ∼ N (E[y], I)

Sigmoid output = σ(w> hl + ) b
For classication with Bernouilli distribution P (y = 1|x) = ŷ

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25


Parameters Training

Objective
Let's dene θ = (W1 , b1 , . . . , Wl , bl ).
We suppose we have a set of inputs X = (x1 , . . . , xN ) and a
set of expected outputs Y = (y1 , . . . , yN ). The goal is to nd
a neural network fNN such that
∀i , fNN (x i , θ) ' yi .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25


Parameters Training

Cost function
To evaluate the error that our current network makes, let's
dene a cost function L(X, Y, θ). The goal becomes to nd
argmin L(X, Y, θ)
θ

Loss function
Should represent a combination of the distances between every
yi and the corresponding fNN (xi , θ)
Mean square error (rare)
Cross-entropy

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25


Parameters Training

Find the minimum


The basic idea consists in computing θ̂ such that
∇θ L(X, Y, θ̂) = 0.

This is dicult to solve analytically e.g. when θ have millions


of degrees of freedom.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25


Parameters Training

Gradient descent
Let's use a numerical way to optimize θ, called the gradient
descent (section 4.3). The idea is that
f (θ − εu) ' f (θ) − εu> ∇f (θ)

So if we take u = ∇f (θ), we have u> u > 0 and then


f (θ − εu) ' f (θ) − εu> u < f (θ).

If f is a function to minimize, we have an update rule that


improves our estimate.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25


Parameters Training

Gradient descent algorithm


1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25


Parameters Training

Gradient descent algorithm


1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

Problem
How to estimate eciently ∇θ L(X, Y, θ̂) ?
Back-propagation algorithm

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25


Back-propagation for Parameter Learning

Consider the architecture:


w 2 w 1

y φ φ
x
with function:

y =φ w2 φ(w1 x ) ,
N
some training pairs T = x̂n , ŷn n=1 , and


an activation-function φ().
Learn w1 , w2 so that: Feeding x̂n results ŷn .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25


Prerequisite: dierentiable activation function

For learning to be possible φ() has to be dierentiable.


Let φ0 (x ) = ∂φ(x ) denote the derivative of φ(x ).
∂x
For example when φ(x ) = Relu(x ) we have:

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25


Gradient-based Learning

Minimize the loss function L(w1 , w2 , T ).


We will learn the weights by iterating:
 
updated ∂L
 ∂ w1 
  
w1 w1
= −γ , (1)
w2 w2 ∂L
∂ w2

L is the loss function (must be dierentiable): In detail is


L(w1 , w2 , T ) and we want to compute the gradient(s) at
w1 , w2 .

γ is the learning rate (a scalar typically known).

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 21 / 25


Back-propagation

Calculate intermediate The partial derivatives are:


values on all units:
∂L(d )
∂d = L0 ( ).d
a = w1x̂n
6

1 .
∂d
= φ0 ( w2φ(w1x̂n ))
b = φ(w1x̂n ) ∂c
7 .
.
∂c
w2
2

c = w2φ(w1x̂n )
8
∂b = .

w1x̂n )
3 .
∂b
d = φ w2φ(w1x̂n ) ∂a = φ0 ( .
9

4 .

L(d ) = L φ w2 φ(w1 x̂n )



5 .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 22 / 25


Calculating the Gradients I

Apply chain rule:


∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1

∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25


Calculating the Gradients I

Apply chain rule:


∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1

∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

Start the calculation from left-to-right.


We propage the gradients (partial products) from the last
layer towards the input.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25


Calculating the Gradients

And because we have N training pairs:


N
∂L X ∂L(dn ) ∂ dn ∂ cn ∂ bn ∂ an
= ,
∂ w1 n=1 ∂ dn ∂ cn ∂ bn ∂ an w1

N
∂L X ∂L(dn ) ∂ dn ∂ cn
= .
∂ w2 n=1 ∂ dn ∂ cn w2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 24 / 25


Thank you !

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 25 / 25