Lecture 7. Multilayer Perceptron. Backpropagation: COMP90051 Statistical Machine Learning

Lecture 7.
Multilayer
Perceptron. Backpropagation
COMP90051 Statistical Machine Learning
Semester 2, 2017
Lecturer: Andrey Kan
Copyright: University of Melbourne

Statistical Machine Learning (S2 2017) Deck 7
This lecture
• Multilayer perceptron
∗ Model structure
∗ Universal approximation
∗ Training preliminaries
• Backpropagation
∗ Step-by-step derivation
∗ Notes on regularisation
2
Animals in the zoo

Artificial Neural
Recurrent neural
Networks (ANNs)
networks
Feed-forward
Multilayer perceptrons networks art: OpenClipartVectors
at pixabay.com (CC0)
Perceptrons Convolutional
neural networks
• Recurrent neural networks are not covered in this subject

• If time permits, we will cover autoencoders. An autoencoder is an ANN
trained in a specific way.
∗ E.g., a multilayer perceptron can be trained as an autoencoder, or a recurrent
neural network can be trained as an autoencoder.
3
Multilayer Perceptron
Modelling non-linearity via

function composition
4
Limitations of linear models

Some function are linearly separable, but many are not
AND OR XOR
Possible solution: composition

𝑥𝑥1 XOR 𝑥𝑥2 = 𝑥𝑥1 OR 𝑥𝑥2 AND not 𝑥𝑥1 AND 𝑥𝑥2
We are going to combine perceptrons …
5
Simplified graphical representation
Sputnik – first
artificial Earth
satellite
6
Perceptorn is sort of a building block for ANN

• ANNs are not restricted to binary classification
• Nodes in ANN can have various activation functions
1, 𝑖𝑖𝑖𝑖 𝑠𝑠 ≥ 0
Step function 𝑓𝑓 𝑠𝑠 = �
0, 𝑖𝑖𝑖𝑖 𝑠𝑠 < 0
1, 𝑖𝑖𝑖𝑖 𝑠𝑠 ≥ 0
Sign function 𝑓𝑓 𝑠𝑠 = �
−1, 𝑖𝑖𝑖𝑖 𝑠𝑠 < 0
1
Logistic function 𝑓𝑓 𝑠𝑠 =
1 + 𝑒𝑒 −𝑠𝑠
Many others: 𝑡𝑡𝑡𝑡𝑡𝑡𝑡, rectifier, etc.
7
Feed-forward Artificial Neural Network

flow of
computation
𝑣𝑣𝑖𝑖𝑖𝑖
𝑥𝑥𝑖𝑖 are 𝑥𝑥1 𝑤𝑤𝑗𝑗𝑗𝑗 𝑧𝑧𝑖𝑖 are outputs, i.e.,
predicted labels
inputs, i.e., 𝑧𝑧1
attributes 𝑢𝑢1
note: ANNs
note: here 𝑥𝑥𝑖𝑖 are
𝑥𝑥2 𝑧𝑧2 naturally handle
…
components of a multidimensional
single training … output
…
instance 𝒙𝒙 𝑢𝑢𝑝𝑝
𝑧𝑧𝑞𝑞
e.g., for handwritten
a training 𝑥𝑥𝑚𝑚 hidden
digits recognition,
layer
dataset is a set output each output node
of instances input layer can represent the
(ANNs can have
layer probability of a digit
more than one
hidden layer)
8
ANN as a function composition

𝑣𝑣𝑖𝑖𝑖𝑖 𝑧𝑧𝑘𝑘 = ℎ(𝑠𝑠𝑘𝑘 )
𝑥𝑥1 𝑤𝑤𝑗𝑗𝑗𝑗
𝑧𝑧1 𝑝𝑝
𝑢𝑢𝑗𝑗 = 𝑔𝑔(𝑟𝑟𝑗𝑗 )
𝑢𝑢1 𝑠𝑠𝑘𝑘 = 𝑤𝑤0𝑘𝑘 + � 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗𝑗𝑗
𝑥𝑥2 𝑧𝑧2
𝑗𝑗=1
𝑚𝑚 …
𝑟𝑟𝑗𝑗 = 𝑣𝑣0𝑗𝑗 + � 𝑥𝑥𝑖𝑖 𝑣𝑣𝑖𝑖𝑖𝑖 note that 𝑧𝑧𝑘𝑘 is a
… … function composition
𝑖𝑖=1
𝑢𝑢𝑝𝑝 (a function applied to
𝑧𝑧𝑞𝑞 the result of another
function, etc.)
𝑥𝑥𝑚𝑚
you can add bias node 𝑥𝑥0 = 1 to simplify
here 𝑔𝑔, ℎ are activation equations: 𝑟𝑟𝑗𝑗 = ∑𝑚𝑚
𝑖𝑖=0 𝑥𝑥𝑖𝑖 𝑣𝑣𝑖𝑖𝑖𝑖
functions. These can be
either same (e.g., both similarly you can add bias node 𝑢𝑢0 = 1 to
𝑝𝑝
sigmoid) or different simplify equations: 𝑠𝑠𝑘𝑘 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗𝑗𝑗
9
ANN in supervised learning

• ANNs can be naturally adapted to various supervised
learning setups, such as univariate and multivariate
regression, as well as binary and multilabel classification
• Univariate regression 𝑦𝑦 = 𝑓𝑓 𝒙𝒙
∗ e.g., linear regression earlier in the course
• Multivariate regression 𝒚𝒚 = 𝑓𝑓(𝒙𝒙)

∗ predicting values for multiple continuous outcomes
• Binary classification
∗ e.g., predict whether a patient has type II diabetes
• Multivariate classification
∗ e.g., handwritten digits recognition with labels “1”, “2”, etc.
10
The power of ANN as a non-linear model

• ANNs are capable of approximating various non-linear
functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥 2 and 𝑧𝑧 𝑥𝑥 = sin 𝑥𝑥
• For example, consider the
following network. In this
example, hidden unit
activation functions are tanh 𝑢𝑢1
𝑥𝑥 𝑢𝑢2 𝑧𝑧
𝑢𝑢3
Plot by Geek3
Wikimedia Commons (CC3)
11
The power of ANN as a non-linear model

• ANNs are capable of approximating various non-linear
functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥 2 and 𝑧𝑧 𝑥𝑥 = sin 𝑥𝑥
Blue points are the
function values
evaluated at
different 𝑥𝑥. Red lines
are the predictions
from the ANN.
𝑧𝑧(𝑥𝑥)
Dashed lines are
outputs of the
𝑥𝑥 Adapted from Bishop 2007 hidden units
• Universal approximation theorem (Cybenko 1989): An ANN

with a hidden layer with a finite number of units, and mild
assumptions on the activation function, can approximate
continuous functions on compact subsets of 𝑹𝑹𝑛𝑛 arbitrarily well
12
How to train your dragon network?
• You know the drill: Define the loss function and find
parameters that minimise the loss on training data
• In the following, we are

going to use stochastic
gradient descent with a
batch size of one. That is,
we will process training
examples one by one
Adapted from Movie Poster from
Flickr user jdxyw (CC BY-SA 2.0)
13
Training setup: univariate regression
• In what follows we consider 𝑣𝑣𝑖𝑖𝑖𝑖

univariate regression setup 𝑥𝑥1 𝑤𝑤𝑗𝑗
• Moreover, we will use identity 𝑢𝑢1

output activation function 𝑥𝑥2 𝑧𝑧
𝑝𝑝
𝑧𝑧 = ℎ 𝑠𝑠 = 𝑠𝑠 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗 …
…
• This will simplify description 𝑢𝑢𝑝𝑝
of backpropagation. In other
settings, the training 𝑥𝑥𝑚𝑚
procedure is similar
14
Training setup: univariate regression

• How many parameters
does this ANN have? Bias 𝑣𝑣𝑖𝑖𝑖𝑖
𝑥𝑥1 𝑤𝑤𝑗𝑗
nodes 𝑥𝑥0 and 𝑢𝑢0 are
present, but not shown
𝑢𝑢1
𝑥𝑥2 𝑧𝑧
…
𝑚𝑚𝑚𝑚 + 𝑝𝑝 + 1 …
𝑢𝑢𝑝𝑝
𝑚𝑚 + 2 𝑝𝑝 + 1 𝑥𝑥𝑚𝑚
𝑚𝑚 + 1 𝑝𝑝
art: OpenClipartVectors at
pixabay.com (CC0)
15
Loss function for ANN training

• In online training, we need to define the loss between a
single training example {𝒙𝒙, 𝑦𝑦} and ANN’s prediction
𝑓𝑓̂ 𝒙𝒙, 𝜽𝜽 = 𝑧𝑧, where 𝜽𝜽 is a parameter vector comprised
of all coefficients 𝑣𝑣𝑖𝑖𝑖𝑖 and 𝑤𝑤𝑗𝑗
• For regression we can use good old squared error

1 2 1
̂ 𝜽𝜽) − 𝑦𝑦
𝐿𝐿 = 𝑓𝑓(𝒙𝒙, = 𝑧𝑧 − 𝑦𝑦 2
2 2
(the constant is used for mathematical convenience, see later)
• Training means finding the minimum of 𝐿𝐿 as a function

of parameter vector 𝜽𝜽
∗ Fortunately 𝐿𝐿(𝜽𝜽) is a differentiable function
∗ Unfortunately there is no analytic solution in general
16
Stochastic gradient descent for ANN

Choose initial guess 𝜽𝜽(0) , 𝑘𝑘 = 0
Here 𝜽𝜽 is a set of all weights form all layers
For 𝑖𝑖 from 1 to 𝑇𝑇 (epochs)

For 𝑗𝑗 from 1 to 𝑁𝑁 (training examples)
Consider example 𝒙𝒙𝑗𝑗 , 𝑦𝑦𝑗𝑗
Update: 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖) )
1
Need to compute partial
2
𝐿𝐿 = 𝑧𝑧𝑗𝑗 − 𝑦𝑦𝑗𝑗 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
2 derivatives and
𝜕𝜕𝜐𝜐𝑖𝑖𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗
17
Backpropagation
= “backward propagation of errors”
Calculating the gradient

of a loss function
18
Backpropagation: start with the chain rule

• Recall that the output 𝑧𝑧 of an ANN is a function
composition, and hence 𝐿𝐿 𝑧𝑧 is also a composition
2 2 2
∗ 𝐿𝐿 = 0.5 𝑧𝑧 − 𝑦𝑦 = 0.5 ℎ(𝑠𝑠) − 𝑦𝑦 = 0.5 𝑠𝑠 − 𝑦𝑦
2 2
∗ = 0.5 ∑𝑝𝑝𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗 − 𝑦𝑦 = 0.5 ∑𝑝𝑝𝑗𝑗=0 𝑔𝑔(𝑟𝑟𝑗𝑗 )𝑤𝑤𝑗𝑗 − 𝑦𝑦 =⋯
• Backpropagation makes use of

this fact by applying the chain
𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗
rule for derivatives 𝑢𝑢1

𝑥𝑥2 𝑧𝑧
𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 …
• =
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗
…
𝑢𝑢𝑝𝑝
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
• =
𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝑥𝑥𝑚𝑚
19
Backpropagation: intermediate step

• Apply the chain rule • Now define
𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧
•
𝜕𝜕𝜕𝜕
=
𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝛿𝛿 ≡ =
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑠𝑠 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
𝜀𝜀𝑗𝑗 ≡ =
• = 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
• Here 𝐿𝐿 = 0.5 𝑧𝑧 − 𝑦𝑦 2
and 𝑧𝑧 = 𝑠𝑠
𝑢𝑢1 Thus 𝛿𝛿 = 𝑧𝑧 − 𝑦𝑦
𝑥𝑥2 𝑧𝑧 𝑝𝑝
… • Here 𝑠𝑠 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗 and
…
𝑢𝑢𝑝𝑝 𝑢𝑢𝑗𝑗 = ℎ 𝑟𝑟𝑗𝑗
Thus 𝜀𝜀𝑗𝑗 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ (𝑟𝑟𝑗𝑗 )
𝑥𝑥𝑚𝑚
20
Backpropagation equations
• We have … where • Recall that
𝑝𝑝
∗
𝜕𝜕𝜕𝜕
= 𝛿𝛿
𝜕𝜕𝜕𝜕
∗ 𝛿𝛿 =
𝜕𝜕𝜕𝜕
= 𝑧𝑧 − 𝑦𝑦 ∗ 𝑠𝑠 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕 ∗ 𝑟𝑟𝑗𝑗 = ∑𝑚𝑚
𝑖𝑖=0 𝑥𝑥𝑖𝑖 𝑣𝑣𝑖𝑖𝑖𝑖
∗
𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗
𝜕𝜕𝑟𝑟𝑗𝑗 ∗ 𝜀𝜀𝑗𝑗 = = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ (𝑟𝑟𝑗𝑗 )
𝜕𝜕𝑟𝑟𝑗𝑗
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖
𝜕𝜕𝑠𝑠 𝜕𝜕𝑟𝑟𝑗𝑗
• So = 𝑢𝑢𝑗𝑗 and = 𝑥𝑥𝑖𝑖
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖
𝑢𝑢1 • We have
𝑥𝑥2 𝑧𝑧 𝜕𝜕𝜕𝜕
… ∗ = 𝛿𝛿𝑢𝑢𝑗𝑗 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗
…
𝑢𝑢𝑝𝑝 ∗
𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗 𝑥𝑥𝑖𝑖 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖
𝑥𝑥𝑚𝑚
21
Forward propagation
• Use current • Calculate 𝑟𝑟𝑗𝑗 ,
estimates of 𝑢𝑢𝑗𝑗 , 𝑠𝑠 and 𝑧𝑧
𝑣𝑣𝑖𝑖𝑖𝑖 and 𝑤𝑤𝑗𝑗
𝑢𝑢1 • Backpropagation equations

𝑥𝑥2 𝑧𝑧 𝜕𝜕𝜕𝜕
…
𝜕𝜕𝜕𝜕
𝑥𝑥𝑚𝑚
22
Backward propagation of errors

𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗 𝑥𝑥𝑖𝑖 ′
𝜀𝜀𝑗𝑗 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ (𝑟𝑟𝑗𝑗 ) = 𝛿𝛿𝑢𝑢𝑗𝑗 𝛿𝛿 = (𝑧𝑧 − 𝑦𝑦)
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜕𝜕𝑤𝑤𝑗𝑗
𝑢𝑢1 • Backpropagation equations

𝑥𝑥2 𝑧𝑧 𝜕𝜕𝜕𝜕
…
𝜕𝜕𝜕𝜕
𝑥𝑥𝑚𝑚
23
Some further notes on ANN training

• ANN is a flexible model (recall universal approximation
theorem), but the flipside of it is over-parameterisation,
hence tendency to overfitting
• Starting weights are usually small random values
distributed around zero
• Implicit regularisation:
early stopping
∗ With some activation
functions, this shrinks
the ANN towards a
linear model (why?) Plot by Geek3
Wikimedia Commons (CC3)
24
Explicit regularisation
• Alternatively, an explicit regularisation can be used,
much like in ridge regression
• Instead of minimising the loss 𝐿𝐿, minimise regularised
𝑝𝑝 𝑝𝑝
function 𝐿𝐿 + 𝜆𝜆 ∑𝑚𝑚 ∑ 𝑣𝑣 2
𝑖𝑖=0 𝑗𝑗=1 𝑖𝑖𝑖𝑖 + ∑ 𝑤𝑤
𝑗𝑗=0 𝑗𝑗
2
• This will simply add 2𝜆𝜆𝑣𝑣𝑖𝑖𝑖𝑖 and 2𝜆𝜆𝑤𝑤𝑗𝑗 terms to the partial
derivatives
• With some activation functions this also shrinks the ANN
towards a linear model
25
This lecture
• Multilayer perceptron
∗ Model structure
∗ Universal approximation
∗ Training preliminaries
• Backpropagation
∗ Step-by-step derivation
∗ Notes on regularisation
26

Lecture 7. Multilayer Perceptron. Backpropagation: COMP90051 Statistical Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 7. Multilayer Perceptron. Backpropagation: COMP90051 Statistical Machine Learning

Uploaded by

Copyright:

Available Formats

Lecture 7.

Copyright: University of Melbourne

Animals in the zoo

• Recurrent neural networks are not covered in this subject

Modelling non-linearity via

Limitations of linear models

Possible solution: composition

We are going to combine perceptrons …

Simplified graphical representation

Perceptorn is sort of a building block for ANN

Many others: 𝑡𝑡𝑡𝑡𝑡𝑡𝑡, rectifier, etc.

Feed-forward Artificial Neural Network

ANN as a function composition

ANN in supervised learning

• Multivariate regression 𝒚𝒚 = 𝑓𝑓(𝒙𝒙)

The power of ANN as a non-linear model

The power of ANN as a non-linear model

• Universal approximation theorem (Cybenko 1989): An ANN

How to train your dragon network?

• In the following, we are

Training setup: univariate regression

• In what follows we consider 𝑣𝑣𝑖𝑖𝑖𝑖

• Moreover, we will use identity 𝑢𝑢1

Training setup: univariate regression

Loss function for ANN training

• For regression we can use good old squared error

• Training means finding the minimum of 𝐿𝐿 as a function

Stochastic gradient descent for ANN

For 𝑖𝑖 from 1 to 𝑇𝑇 (epochs)

Calculating the gradient

Backpropagation: start with the chain rule

• Backpropagation makes use of

rule for derivatives 𝑢𝑢1

Backpropagation: intermediate step

𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

𝑢𝑢1 • Backpropagation equations

Backward propagation of errors

𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

𝑢𝑢1 • Backpropagation equations

Some further notes on ANN training

You might also like