Professional Documents
Culture Documents
Presentation PDF
Presentation PDF
Prediction : y = W> x + b
Prediction : y = W> x + b
W b
x u y
Prediction : y = W> x + b
x1 W11 . . . W23 b1
u1 y1
x2 b2
u2 y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b
Prediction : y = W> x + b
x1 W, b
y1
x2
y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Advantages
Easy to use
Easy to train, low risk of overtting
Drawbacks
Some problems are inherently non-linear
Linear regressor
x1 W, b
y
x2
There is no value for W and b
such that ∀(x1 , x2 ) ∈ {0, 1}2
x1
W >
+ b = xor (x1 , x2 )
x2
What about... ?
W, b
x1 u1 V, c
y
x2 u2
What about... ?
W, b
x1 u1 V, c
y
x2 u2
Strictly equivalent :
The composition of two linear operation is still a linear
operation
It is possible !
1 1 0 1
With W = ,b= ,V= and c = 0,
1 1 −1 −2
Compact representation
W , b, φ V, c
x h y
Neural network
Hidden layer with non-linearity
→ can represent broader class of function
Theorem
A neural network with one hidden layer can approximate any
continuous function
K
:x→ vi φ(wi > x + bi ) + c
X
ε
∀ε, ∃fNN
i =1
such that
ε
∀x ∈ Cn , ||f (x ) − fNN (x )|| < ε
W 1 , b1 , φ W 2 , b2 , φ W l , bl , φ V, c
x h1 ... hl y
Properties of DNN
The universal approximation theorem also apply
Some functions can be approximated by a DNN with N
hidden unit, and would require O(e N ) hidden units to be
represented by a shallow network.
+ Simple
+ Simple
Remaining problem
How to get this DNN ?
Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit
Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit
Parameters
When the architecture is dened, we need to train the DNN
W1 , b1 , . . . , Wl , bl
The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve
The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve
Sigmoid 1
− −
The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve
Sigmoid 1
− −
Objective
Let's dene θ = (W1 , b1 , . . . , Wl , bl ).
We suppose we have a set of inputs X = (x1 , . . . , xN ) and a
set of expected outputs Y = (y1 , . . . , yN ). The goal is to nd
a neural network fNN such that
∀i , fNN (x i , θ) ' yi .
Cost function
To evaluate the error that our current network makes, let's
dene a cost function L(X, Y, θ). The goal becomes to nd
argmin L(X, Y, θ)
θ
Loss function
Should represent a combination of the distances between every
yi and the corresponding fNN (xi , θ)
Mean square error (rare)
Cross-entropy
Gradient descent
Let's use a numerical way to optimize θ, called the gradient
descent (section 4.3). The idea is that
f (θ − εu) ' f (θ) − εu> ∇f (θ)
Problem
How to estimate eciently ∇θ L(X, Y, θ̂) ?
Back-propagation algorithm
y φ φ
x
with function:
y =φ w2 φ(w1 x ) ,
N
some training pairs T = x̂n , ŷn n=1 , and
an activation-function φ().
Learn w1 , w2 so that: Feeding x̂n results ŷn .
1 .
∂d
= φ0 ( w2φ(w1x̂n ))
b = φ(w1x̂n ) ∂c
7 .
.
∂c
w2
2
c = w2φ(w1x̂n )
8
∂b = .
w1x̂n )
3 .
∂b
d = φ w2φ(w1x̂n ) ∂a = φ0 ( .
9
4 .
∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2
∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2
N
∂L X ∂L(dn ) ∂ dn ∂ cn
= .
∂ w2 n=1 ∂ dn ∂ cn w2