You are on page 1of 38

# Deep Learning book, by Ian Goodfellow,

## Yoshua Bengio and Aaron Courville

Chapter 6 :Deep Feedforward Networks

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25

Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

W b
x u y

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25

Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

x1 W11 . . . W23 b1
u1 y1
x2 b2
u2 y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

x1 W, b
y1
x2
y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)

Advantages
Easy to use
Easy to train, low risk of overtting

Drawbacks
Some problems are inherently non-linear

Solving XOR

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25

Solving XOR

Linear regressor

x1 W, b
y
x2
There is no value for W and b
such that ∀(x1 , x2 ) ∈ {0, 1}2
 
x1
W >
+ b = xor (x1 , x2 )
x2

Solving XOR

What about... ?
W, b
x1 u1 V, c
y
x2 u2

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25

Solving XOR

What about... ?
W, b
x1 u1 V, c
y
x2 u2
Strictly equivalent :
The composition of two linear operation is still a linear
operation

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25

Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25

Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

It is possible !
1 1 0 1
     
With W = ,b= ,V= and c = 0,
1 1 −1 −2

## Vφ(Wx + b) = xor (x1 , x2 )

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25
Neural network with one hidden layer

Compact representation
W , b, φ V, c
x h y

Neural network
Hidden layer with non-linearity
→ can represent broader class of function

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25

Universal approximation theorem

Theorem
A neural network with one hidden layer can approximate any
continuous function

## More formally, given a continuous function f : Cn 7→ Rm where

n
Cn is a compact subset of R ,

K
:x→ vi φ(wi > x + bi ) + c
X
ε
∀ε, ∃fNN
i =1
such that
ε
∀x ∈ Cn , ||f (x ) − fNN (x )|| < ε

Problems

## Obtaining the network

The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b

Problems

## Obtaining the network

The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b
Using the network
Even if we nd a way to obtain the network, the size of the
hidden layer may be prohibitively large.

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25

Deep neural network
Why Deep ?
Let's stack l hidden layers one after the other; l is called the
length of the network.

W 1 , b1 , φ W 2 , b2 , φ W l , bl , φ V, c
x h1 ... hl y

Properties of DNN
The universal approximation theorem also apply
Some functions can be approximated by a DNN with N
hidden unit, and would require O(e N ) hidden units to be
represented by a shallow network.

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25

Summary
Comparison
Linear classier
− Limited representational power

+ Simple

## Shallow Neural network (Exactly one hidden layer)

+ Unlimited representational power

## Deep Neural network

+ Unlimited representational power

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25

Summary
Comparison
Linear classier
− Limited representational power

+ Simple

## Shallow Neural network (Exactly one hidden layer)

+ Unlimited representational power

## Deep Neural network

+ Unlimited representational power

## + Relatively small number of hidden units needed

Remaining problem
How to get this DNN ?

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25

On the path of getting my own DNN

Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25

On the path of getting my own DNN

Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

Parameters
When the architecture is dened, we need to train the DNN
W1 , b1 , . . . , Wl , bl

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25

Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

## The activation function φ

g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0

Sigmoid 1
− −

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

## The activation function φ

g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0

Sigmoid 1
− −

## The output unit

Linear outputE[y] = V> hl + c
For regression with Gaussian distribution y ∼ N (E[y], I)

Sigmoid output = σ(w> hl + ) b
For classication with Bernouilli distribution P (y = 1|x) = ŷ

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25

Parameters Training

Objective
Let's dene θ = (W1 , b1 , . . . , Wl , bl ).
We suppose we have a set of inputs X = (x1 , . . . , xN ) and a
set of expected outputs Y = (y1 , . . . , yN ). The goal is to nd
a neural network fNN such that
∀i , fNN (x i , θ) ' yi .

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25

Parameters Training

Cost function
To evaluate the error that our current network makes, let's
dene a cost function L(X, Y, θ). The goal becomes to nd
argmin L(X, Y, θ)
θ

Loss function
Should represent a combination of the distances between every
yi and the corresponding fNN (xi , θ)
Mean square error (rare)
Cross-entropy

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25

Parameters Training

## Find the minimum

The basic idea consists in computing θ̂ such that
∇θ L(X, Y, θ̂) = 0.

## This is dicult to solve analytically e.g. when θ have millions

of degrees of freedom.

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25

Parameters Training

Gradient descent
Let's use a numerical way to optimize θ, called the gradient
descent (section 4.3). The idea is that
f (θ − εu) ' f (θ) − εu> ∇f (θ)

## So if we take u = ∇f (θ), we have u> u > 0 and then

f (θ − εu) ' f (θ) − εu> u < f (θ).

## If f is a function to minimize, we have an update rule that

improves our estimate.

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25

Parameters Training

## Gradient descent algorithm

1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25

Parameters Training

## Gradient descent algorithm

1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

Problem
How to estimate eciently ∇θ L(X, Y, θ̂) ?
Back-propagation algorithm

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25

Back-propagation for Parameter Learning

## Consider the architecture:

w 2 w 1

y φ φ
x
with function:

y =φ w2 φ(w1 x ) ,
N
some training pairs T = x̂n , ŷn n=1 , and


an activation-function φ().
Learn w1 , w2 so that: Feeding x̂n results ŷn .

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25

Prerequisite: dierentiable activation function

## For learning to be possible φ() has to be dierentiable.

Let φ0 (x ) = ∂φ(x ) denote the derivative of φ(x ).
∂x
For example when φ(x ) = Relu(x ) we have:

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25

Gradient-based Learning

## Minimize the loss function L(w1 , w2 , T ).

We will learn the weights by iterating:
 
updated ∂L
 ∂ w1 
  
w1 w1
= −γ , (1)
w2 w2 ∂L
∂ w2

## L is the loss function (must be dierentiable): In detail is

L(w1 , w2 , T ) and we want to compute the gradient(s) at
w1 , w2 .

Back-propagation

## Calculate intermediate The partial derivatives are:

values on all units:
∂L(d )
∂d = L0 ( ).d
a = w1x̂n
6

1 .
∂d
= φ0 ( w2φ(w1x̂n ))
b = φ(w1x̂n ) ∂c
7 .
.
∂c
w2
2

c = w2φ(w1x̂n )
8
∂b = .

w1x̂n )
3 .
∂b
d = φ w2φ(w1x̂n ) ∂a = φ0 ( .
9

4 .


5 .

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 22 / 25

Calculating the Gradients I

## Apply chain rule:

∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1

∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25

Calculating the Gradients I

## Apply chain rule:

∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1

∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

## Start the calculation from left-to-right.

We propage the gradients (partial products) from the last
layer towards the input.

## Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25

Calculating the Gradients

## And because we have N training pairs:

N
∂L X ∂L(dn ) ∂ dn ∂ cn ∂ bn ∂ an
= ,
∂ w1 n=1 ∂ dn ∂ cn ∂ bn ∂ an w1

N
∂L X ∂L(dn ) ∂ dn ∂ cn
= .
∂ w2 n=1 ∂ dn ∂ cn w2

Thank you !