Professional Documents
Culture Documents
Philip O. Ogunbona
1 Introduction
2 Models of a Neuron
4 Network Architecture
5 Learning Process
6 Perceptron
8 References
A neural network is a machine designed to model the way in which the brain performs a
particular task or function of interest.
m
X
uk = ωkj xj (1)
j=1
yk = ϕ(uk + bk ) (2)
where
x1 , x2 , . . . xm are the input signals;
ω1 , ω2 , . . . , ωm are the respective synaptic weights of neuron k;
uk is the linear combiner output due to the input signals
bk is the bias;
ϕ(·) is the activation function;
Bias bk applies an affine transformation to the output uk of the linear combiner
vk = uk + bk (3)
and
yk = ϕ(vk ) (5)
x0 = +1 (6)
and weight
Figure 2: Model of neuron with the bias
ωk0 = bk (7) absorbed into the neuron (Haykin 2009).
See Figure (2).
Choosing the right activation function is both science and art. For further insight, see the
works of Ramachandran et al. (2017) and Mhaskar & Micchelli (1994)
Together with the right cost function, activation functions make training NN possible.
3
X
v1 = ω1j xj
j=0
Figure 14: Recurrent Neural Network with Hidden Layer (Haykin (2009))
Types of Learning
Recall that in general we want to learn a mapping from input vector x to some output y
through a vector of weights ω
y = f (ω, x) (17)
such that the error (or loss or cost function) incurred in the prediction of the actual
value is minimized.
Figure 15: Illustration of the gradient descent algorithm (Goodfellow et al. 2016, p.80)
Let x = {x1 , x2 , . . . xm };
t
∂f ∂f ∂f
∇f (x) = , ...
∂x1 ∂x2 ∂xm
∂f
Partial derivative measures how f changes as only the variable xi
∂xi
increases at point x.
∂
f (x + αu) = ut ∇f (x) = ||u||2 ||∇f (x)||2 cos θ
∂α
Minimize f by finding the direction in which f decreases fastest; Do this by
minimizing the directional derivative
Figure 19: (a) Linearly separable patterns; (b) Linearly non-separable patterns Haykin (2009)
Network contains one or more layers that are hidden from both the input
and output nodes
Training method
Multilayer perceptron is usually trained using the back-propagation algorithm:
Forward phase: Weights of the network are fixed and input signal is propagated layer-wise
through the network and transformed signal appears at the output
Backward phase: Error signal is computed by comparing generated output and desired
response; error signal is propagated backward and layer-wise through the network;
successive adjustments made to weights of the network
This is equivalent to modifying the weights so that the network minimizes the error between
desired output and response of the network
Gradient descent algorithm can be used to find the minimum of an objective function by
iteratively computing the adjustment that leads to the minimization of the objective function
From Equation (20) the adjustment is proportional to the gradient of the objective function;
in this case ∇E (E is error signal energy) with respect to the parameters ω
1 2
Ej (n) = e (n) (22)
2 j
Total instantaneous error (summed over all neurons in the output layer) is
X 1X 2
E(n) = Ej (n) = e (n) (23)
j∈C
2 j∈C j
Computation of the error could be in batch mode or on-line mode leading to either batch
mode (presentation of all training samples) or on-line (presentation of training sample
one-at-a-time) training (stochastic)
1 2
Ej (n) = e (n) (22)
2 j
Total instantaneous error (summed over all neurons in the output layer) is
X 1X 2
E(n) = Ej (n) = e (n) (23)
j∈C
2 j∈C j
Computation of the error could be in batch mode or on-line mode leading to either batch
mode (presentation of all training samples) or on-line (presentation of training sample
one-at-a-time) training (stochastic)
1 2
Ej (n) = e (n) (22)
2 j
Total instantaneous error (summed over all neurons in the output layer) is
X 1X 2
E(n) = Ej (n) = e (n) (23)
j∈C
2 j∈C j
Computation of the error could be in batch mode or on-line mode leading to either batch
mode (presentation of all training samples) or on-line (presentation of training sample
one-at-a-time) training (stochastic)
1 2
Ej (n) = e (n) (22)
2 j
Total instantaneous error (summed over all neurons in the output layer) is
X 1X 2
E(n) = Ej (n) = e (n) (23)
j∈C
2 j∈C j
Computation of the error could be in batch mode or on-line mode leading to either batch
mode (presentation of all training samples) or on-line (presentation of training sample
one-at-a-time) training (stochastic)
∂E(n)
This is proportional to the partial derivative and determines the
∂ωji (n)
direction of search in the weight space for ωji
∂E(n)
Chain rule tells us how to compute from a set of known quantities
∂ωji (n)
∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)
= (26)
∂ωji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂ωji (n)
∂E(n)
= ej (n) (27)
∂ej (n)
∂E(n)
This is proportional to the partial derivative and determines the
∂ωji (n)
direction of search in the weight space for ωji
∂E(n)
Chain rule tells us how to compute from a set of known quantities
∂ωji (n)
∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)
= (26)
∂ωji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂ωji (n)
∂E(n)
= ej (n) (27)
∂ej (n)
∂E(n)
This is proportional to the partial derivative and determines the
∂ωji (n)
direction of search in the weight space for ωji
∂E(n)
Chain rule tells us how to compute from a set of known quantities
∂ωji (n)
∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)
= (26)
∂ωji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂ωji (n)
∂E(n)
= ej (n) (27)
∂ej (n)
∂E(n)
This is proportional to the partial derivative and determines the
∂ωji (n)
direction of search in the weight space for ωji
∂E(n)
Chain rule tells us how to compute from a set of known quantities
∂ωji (n)
∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)
= (26)
∂ωji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂ωji (n)
∂E(n)
= ej (n) (27)
∂ej (n)
∂ej (n)
= −1 (28)
∂yj (n)
∂yj (n)
= ϕ0j (vj (n)); where ()0 indicates differentiation (29)
∂vj (n)
Pm
Recall Equation(24): vj (n) = i=0 ωji (n)yi (n)
∂vj (n)
= yi (n) (30)
∂ωji (n)
Equation (26) becomes (using Equations (27) - (30))
∂E(n)
= −ej (n)ϕ0j (vj (n))yi (n) (31)
∂ωji (n)
∂ej (n)
= −1 (28)
∂yj (n)
∂yj (n)
= ϕ0j (vj (n)); where ()0 indicates differentiation (29)
∂vj (n)
Pm
Recall Equation(24): vj (n) = i=0 ωji (n)yi (n)
∂vj (n)
= yi (n) (30)
∂ωji (n)
Equation (26) becomes (using Equations (27) - (30))
∂E(n)
= −ej (n)ϕ0j (vj (n))yi (n) (31)
∂ωji (n)
∂ej (n)
= −1 (28)
∂yj (n)
∂yj (n)
= ϕ0j (vj (n)); where ()0 indicates differentiation (29)
∂vj (n)
Pm
Recall Equation(24): vj (n) = i=0 ωji (n)yi (n)
∂vj (n)
= yi (n) (30)
∂ωji (n)
Equation (26) becomes (using Equations (27) - (30))
∂E(n)
= −ej (n)ϕ0j (vj (n))yi (n) (31)
∂ωji (n)
∂ej (n)
= −1 (28)
∂yj (n)
∂yj (n)
= ϕ0j (vj (n)); where ()0 indicates differentiation (29)
∂vj (n)
Pm
Recall Equation(24): vj (n) = i=0 ωji (n)yi (n)
∂vj (n)
= yi (n) (30)
∂ωji (n)
Equation (26) becomes (using Equations (27) - (30))
∂E(n)
= −ej (n)ϕ0j (vj (n))yi (n) (31)
∂ωji (n)
Correction, ∆ωji (n), applied to ωji (n) is defined by the delta rule
∂E(n)
∆ωji (n) = −η ; η is the learning rate parameter
∂ωji (n)
= η ej (n)ϕ0j (vj (n)) yi (n)
where δj (n) = ej (n)ϕ0j (vj (n)) is defined as the local gradient for neuron j
Local gradient for neuron j is the product of corresponding error ej (n) and
the derivative of associated activation function, ϕ0j (vj (n))
Error ej (n) is easily computed for the output neurons; we have access to
dj (n) and yj (n). How to compute error for hidden neurons? These have no
given dj (n).
Correction, ∆ωji (n), applied to ωji (n) is defined by the delta rule
∂E(n)
∆ωji (n) = −η ; η is the learning rate parameter
∂ωji (n)
= η ej (n)ϕ0j (vj (n)) yi (n)
where δj (n) = ej (n)ϕ0j (vj (n)) is defined as the local gradient for neuron j
Local gradient for neuron j is the product of corresponding error ej (n) and
the derivative of associated activation function, ϕ0j (vj (n))
Error ej (n) is easily computed for the output neurons; we have access to
dj (n) and yj (n). How to compute error for hidden neurons? These have no
given dj (n).
Correction, ∆ωji (n), applied to ωji (n) is defined by the delta rule
∂E(n)
∆ωji (n) = −η ; η is the learning rate parameter
∂ωji (n)
= η ej (n)ϕ0j (vj (n)) yi (n)
where δj (n) = ej (n)ϕ0j (vj (n)) is defined as the local gradient for neuron j
Local gradient for neuron j is the product of corresponding error ej (n) and
the derivative of associated activation function, ϕ0j (vj (n))
Error ej (n) is easily computed for the output neurons; we have access to
dj (n) and yj (n). How to compute error for hidden neurons? These have no
given dj (n).
This is the product of the learning rate η, local gradient of the associated neuron, δj (n) and the
input to the neuron, yi (n). See Figure (21)
Using chain rule similarly to how we derive the update for the weight of output neurons we
will show that the weight update for hidden neurons is given as
ωjinew (n) = ωjiold (n) + ∆ωji (n)
where neuron j is hidden; ϕ0j (vj (n)) is derivative of associated activation function; δk (n) are
associated with neurons k which are to the immediate right of neuron j and connected to it;
ωkj (n) are the associated weights of these connections (see Figure (22) )
Figure 22: Signal flow showing hidden neuron j connected to an output neuron k to its
immediate right; Diagram used to show the derivation of weight update for hidden neuron (Haykin
2009)
of Equation (35)
Recall from Equation(26)
and Equation(32)
= η δj (n) yi (n)
∂E(n) X
=− ek (n)ϕ0k (vk (n)) ωkj (n)
∂yj (n) k
X
=− δk (n)ωkj (n) (45)
k
X
δj (n) = ϕ0j (vj (n)) δk (n)ωkj (n) (46)
k
and when combined with Equation (32) we can write the correction as
m
X
vj (n) = ωji (n)yi (n); yj (n) = ϕj (vj (n)) (49)
j=0
4 In the Backward pass error is propagated backward through the network to compute weight
updates (see Figure (22) and Equation (45)):
0
η ej (n)ϕj (vj (n)) yi (n)
for output neurons
ωjinew (n) = ωjiold (n) + X (50)
0
η ϕ j (vj (n)) δ k (n)ω kj (n) yi (n) for hidden neurons
k
Recall from Eqn. 20 that the gradient descent update rule is simply (we
have used notation consistent with our development of backpropagation):
Classical momentum
1: gt ← ∇f (ωt−1 )
2: mt ← µmt−1 + gt ; {momentum term accumulated}
3: ωt ← ωt−1 − ηmt
Advantages:
Accelerates GD learning along dimensions where gradient remains relatively
consistent across training steps
Slows GD learning along turbulent dimensions where gradient is oscilating
wildly
AdaGrad
1: gt ← ∇f (ωt−1 )
2: nt ← nt−1 + g2t
3: ωt ← ωt−1 − η √ngtt+
RMSProp
1: gt ← ∇f (ωt−1 )
2: nt ← νnt−1 + (1 − ν)g2t
g
3: ωt ← ωt−1 − η √n t+
t
4: nt ← νnt−1 + (1 − ν)g2t
nt
5: n̂t ← 1−ν t
m̂
6: ωt ← ωt−1 − η √ t
n̂ + t
Rather than use l2 norm, the L∞ could be used as well; replace nt in step
4 and ωt in step 6 and remove n̂t in step 5; new updates (this is AdaMax
algorithm):
nt ← max(νnt−1 , |gt |)
m̂t
ωt ← ωt−1 − η
nt +
SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 53 / 54
Bibliography
Alpaydin, E. (2010), Introduction to Machine Learning, second edn, The MIT Press, Cambridge
Massachusetts.
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B. & Bharath, A. A. (2018),
‘Generative adversarial networks: An overview’, IEEE Signal Processing Magazine 35(1), 53 –
65.
Dozat, T. (2015), Incorporating nestterov momentum in Adam, Technical report, Linguistics
Department, Stanford University.
Duda, R. O., Hart, P. E. & Stork, D. G. (2001), Pattern Classification, Second edn, John Wiley and
Sons.
Géron, A. (2017), Hands-on Machine Learning with Scikit-Learn and TensorFlow, O’Reilly Media,
Inc., CA, USA.
Goodfellow, I., Bengio, Y. & Courville, A. (2016), Deep Learning, MIT Press.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. &
Bengio, Y. (2014), Generative adversarial nets, in ‘Proc. Advances Neural Information
Processing Systems Conf’, Montreal, Quebec, Canada, p. 2672–2680.
Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning - Data
Mining, Inference and Prediction, Springer Science+Business Media LLC.
Haykin, S. (2009), Neural Networks and Learning Machines, Third edn, Pearson Education.
Kaiming He, Xiangyu Zhang, S. R. J. S. (2015), Deep residual learning for image recognition,
Technical Report arXiv:1512.03385 [cs.CV], ArXiV.
URL: https://arxiv.org/pdf/1512.03385.pdf
Kingma, D. P. & Ba, J. L. (2015), Adam: A method fo stochastic optmization, in ‘International
Conferenece on Language Representtaions’, San Diego, CA. USA, pp. 1– 15.
Lemarechal, C. (2012), ‘Cauchy and the gradient method’, Documenta Mathematica Extra
Volume ISMP, 251–254.
SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 54 / 54