Professional Documents
Culture Documents
Multi-layer Perceptron
Lecturer: Barnabas Poczos
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
appcr1
Struture of MLP:
1
2 Multi-layer Perceptron: Barnabas Poczos
Here we will always assume that the activation function is differentiable. This will allow us to optimize the
cost function with gradient descent. However, non-differentiable activation functions are getting popular as
well.
More generally:
NL
X NL
X
2
= 2p = (ŷp − yp )2 .
p=1 p=1
We want to calculate
∂(k)2
=?.
∂Wijl (k)
2.2 Notation
• Wijl (k): At time step k, the strength of connection from neuron j on layer l − 1 to neuron i on layer l.
(i = 1, 2, ..., Nl , j = 1, 2, ..., Nl−1 )
• sli (k): The summed input of neuron i on layer l before applying the activation function f at time step
k(i = 1, ..., Nl ).
xl = ŷ l−1 ∈ RNl−1
Nl−1 Nl−1
X X
sli = Wi·l ŷ l−1 = Wijl xlj = Wijl f (sl−1
j )
j=1 j=1
Nl
X
sl+1
j = l+1
Wji f (sli )
i=1
where i = 1, 2, ..., Nl .
As a special case, we have that
NL
X ∂(yp (k) − f (sL
p (k)))
2
δiL (k) = − = 2i (k)f 0 (sL
i (k))
p=1
∂sL
i (k)
4 Multi-layer Perceptron: Barnabas Poczos
NL NL N
X ∂2p X Xl+1
∂2p ∂sl+1
j
δil (k) = − = −
p=1
∂sli p=1 j=1
∂sl+1
j
∂s l
i
Nl+1 NL
XX ∂2p l+1 0 l
= − Wji f (si )
j=1 p=1
∂sl+1
j
Therefore,
Nl+1
X
δil (k) = ( δjl+1 Wji
l+1
(k))f 0 (sli (k))
j=1
In vector form:
Wi·l (k + 1) = Wi·l (k) + µδil (k)xl· (k).