You are on page 1of 26

Lecture 7.

Multilayer
Perceptron. Backpropagation
COMP90051 Statistical Machine Learning

Semester 2, 2017
Lecturer: Andrey Kan

Copyright: University of Melbourne


Statistical Machine Learning (S2 2017) Deck 7

This lecture
• Multilayer perceptron
∗ Model structure
∗ Universal approximation
∗ Training preliminaries

• Backpropagation
∗ Step-by-step derivation
∗ Notes on regularisation

2
Statistical Machine Learning (S2 2017) Deck 7

Animals in the zoo


Artificial Neural
Recurrent neural
Networks (ANNs)
networks

Feed-forward
Multilayer perceptrons networks art: OpenClipartVectors
at pixabay.com (CC0)

Perceptrons Convolutional
neural networks

• Recurrent neural networks are not covered in this subject


• If time permits, we will cover autoencoders. An autoencoder is an ANN
trained in a specific way.
∗ E.g., a multilayer perceptron can be trained as an autoencoder, or a recurrent
neural network can be trained as an autoencoder.
3
Statistical Machine Learning (S2 2016) Deck 7

Multilayer Perceptron

Modelling non-linearity via


function composition

4
Statistical Machine Learning (S2 2017) Deck 7

Limitations of linear models


Some function are linearly separable, but many are not

AND OR XOR

Possible solution: composition


𝑥𝑥1 XOR 𝑥𝑥2 = 𝑥𝑥1 OR 𝑥𝑥2 AND not 𝑥𝑥1 AND 𝑥𝑥2

We are going to combine perceptrons …

5
Statistical Machine Learning (S2 2017) Deck 7

Simplified graphical representation

Sputnik – first
artificial Earth
satellite

6
Statistical Machine Learning (S2 2017) Deck 7

Perceptorn is sort of a building block for ANN


• ANNs are not restricted to binary classification
• Nodes in ANN can have various activation functions
1, 𝑖𝑖𝑖𝑖 𝑠𝑠 ≥ 0
Step function 𝑓𝑓 𝑠𝑠 = �
0, 𝑖𝑖𝑖𝑖 𝑠𝑠 < 0

1, 𝑖𝑖𝑖𝑖 𝑠𝑠 ≥ 0
Sign function 𝑓𝑓 𝑠𝑠 = �
−1, 𝑖𝑖𝑖𝑖 𝑠𝑠 < 0
1
Logistic function 𝑓𝑓 𝑠𝑠 =
1 + 𝑒𝑒 −𝑠𝑠

Many others: 𝑡𝑡𝑡𝑡𝑡𝑡𝑡, rectifier, etc.

7
Statistical Machine Learning (S2 2017) Deck 7

Feed-forward Artificial Neural Network


flow of
computation
𝑣𝑣𝑖𝑖𝑖𝑖
𝑥𝑥𝑖𝑖 are 𝑥𝑥1 𝑤𝑤𝑗𝑗𝑗𝑗 𝑧𝑧𝑖𝑖 are outputs, i.e.,
predicted labels
inputs, i.e., 𝑧𝑧1
attributes 𝑢𝑢1
note: ANNs
note: here 𝑥𝑥𝑖𝑖 are
𝑥𝑥2 𝑧𝑧2 naturally handle

components of a multidimensional
single training … output

instance 𝒙𝒙 𝑢𝑢𝑝𝑝
𝑧𝑧𝑞𝑞
e.g., for handwritten
a training 𝑥𝑥𝑚𝑚 hidden
digits recognition,
layer
dataset is a set output each output node
of instances input layer can represent the
(ANNs can have
layer probability of a digit
more than one
hidden layer)

8
Statistical Machine Learning (S2 2017) Deck 7

ANN as a function composition


𝑣𝑣𝑖𝑖𝑖𝑖 𝑧𝑧𝑘𝑘 = ℎ(𝑠𝑠𝑘𝑘 )
𝑥𝑥1 𝑤𝑤𝑗𝑗𝑗𝑗
𝑧𝑧1 𝑝𝑝

𝑢𝑢𝑗𝑗 = 𝑔𝑔(𝑟𝑟𝑗𝑗 )
𝑢𝑢1 𝑠𝑠𝑘𝑘 = 𝑤𝑤0𝑘𝑘 + � 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗𝑗𝑗
𝑥𝑥2 𝑧𝑧2
𝑗𝑗=1
𝑚𝑚 …
𝑟𝑟𝑗𝑗 = 𝑣𝑣0𝑗𝑗 + � 𝑥𝑥𝑖𝑖 𝑣𝑣𝑖𝑖𝑖𝑖 note that 𝑧𝑧𝑘𝑘 is a
… … function composition
𝑖𝑖=1
𝑢𝑢𝑝𝑝 (a function applied to
𝑧𝑧𝑞𝑞 the result of another
function, etc.)
𝑥𝑥𝑚𝑚
you can add bias node 𝑥𝑥0 = 1 to simplify
here 𝑔𝑔, ℎ are activation equations: 𝑟𝑟𝑗𝑗 = ∑𝑚𝑚
𝑖𝑖=0 𝑥𝑥𝑖𝑖 𝑣𝑣𝑖𝑖𝑖𝑖
functions. These can be
either same (e.g., both similarly you can add bias node 𝑢𝑢0 = 1 to
𝑝𝑝
sigmoid) or different simplify equations: 𝑠𝑠𝑘𝑘 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗𝑗𝑗
9
Statistical Machine Learning (S2 2017) Deck 7

ANN in supervised learning


• ANNs can be naturally adapted to various supervised
learning setups, such as univariate and multivariate
regression, as well as binary and multilabel classification
• Univariate regression 𝑦𝑦 = 𝑓𝑓 𝒙𝒙
∗ e.g., linear regression earlier in the course

• Multivariate regression 𝒚𝒚 = 𝑓𝑓(𝒙𝒙)


∗ predicting values for multiple continuous outcomes

• Binary classification
∗ e.g., predict whether a patient has type II diabetes

• Multivariate classification
∗ e.g., handwritten digits recognition with labels “1”, “2”, etc.
10
Statistical Machine Learning (S2 2017) Deck 7

The power of ANN as a non-linear model


• ANNs are capable of approximating various non-linear
functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥 2 and 𝑧𝑧 𝑥𝑥 = sin 𝑥𝑥
• For example, consider the
following network. In this
example, hidden unit
activation functions are tanh 𝑢𝑢1

𝑥𝑥 𝑢𝑢2 𝑧𝑧

𝑢𝑢3
Plot by Geek3
Wikimedia Commons (CC3)

11
Statistical Machine Learning (S2 2017) Deck 7

The power of ANN as a non-linear model


• ANNs are capable of approximating various non-linear
functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥 2 and 𝑧𝑧 𝑥𝑥 = sin 𝑥𝑥
Blue points are the
function values
evaluated at
different 𝑥𝑥. Red lines
are the predictions
from the ANN.
𝑧𝑧(𝑥𝑥)
Dashed lines are
outputs of the
𝑥𝑥 Adapted from Bishop 2007 hidden units

• Universal approximation theorem (Cybenko 1989): An ANN


with a hidden layer with a finite number of units, and mild
assumptions on the activation function, can approximate
continuous functions on compact subsets of 𝑹𝑹𝑛𝑛 arbitrarily well
12
Statistical Machine Learning (S2 2017) Deck 7

How to train your dragon network?

• You know the drill: Define the loss function and find
parameters that minimise the loss on training data

• In the following, we are


going to use stochastic
gradient descent with a
batch size of one. That is,
we will process training
examples one by one
Adapted from Movie Poster from
Flickr user jdxyw (CC BY-SA 2.0)

13
Statistical Machine Learning (S2 2017) Deck 7

Training setup: univariate regression

• In what follows we consider 𝑣𝑣𝑖𝑖𝑖𝑖


univariate regression setup 𝑥𝑥1 𝑤𝑤𝑗𝑗

• Moreover, we will use identity 𝑢𝑢1


output activation function 𝑥𝑥2 𝑧𝑧
𝑝𝑝
𝑧𝑧 = ℎ 𝑠𝑠 = 𝑠𝑠 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗 …


• This will simplify description 𝑢𝑢𝑝𝑝
of backpropagation. In other
settings, the training 𝑥𝑥𝑚𝑚
procedure is similar

14
Statistical Machine Learning (S2 2017) Deck 7

Training setup: univariate regression


• How many parameters
does this ANN have? Bias 𝑣𝑣𝑖𝑖𝑖𝑖
𝑥𝑥1 𝑤𝑤𝑗𝑗
nodes 𝑥𝑥0 and 𝑢𝑢0 are
present, but not shown
𝑢𝑢1
𝑥𝑥2 𝑧𝑧

𝑚𝑚𝑚𝑚 + 𝑝𝑝 + 1 …
𝑢𝑢𝑝𝑝
𝑚𝑚 + 2 𝑝𝑝 + 1 𝑥𝑥𝑚𝑚

𝑚𝑚 + 1 𝑝𝑝
art: OpenClipartVectors at
pixabay.com (CC0)

15
Statistical Machine Learning (S2 2017) Deck 7

Loss function for ANN training


• In online training, we need to define the loss between a
single training example {𝒙𝒙, 𝑦𝑦} and ANN’s prediction
𝑓𝑓̂ 𝒙𝒙, 𝜽𝜽 = 𝑧𝑧, where 𝜽𝜽 is a parameter vector comprised
of all coefficients 𝑣𝑣𝑖𝑖𝑖𝑖 and 𝑤𝑤𝑗𝑗

• For regression we can use good old squared error


1 2 1
̂ 𝜽𝜽) − 𝑦𝑦
𝐿𝐿 = 𝑓𝑓(𝒙𝒙, = 𝑧𝑧 − 𝑦𝑦 2
2 2
(the constant is used for mathematical convenience, see later)

• Training means finding the minimum of 𝐿𝐿 as a function


of parameter vector 𝜽𝜽
∗ Fortunately 𝐿𝐿(𝜽𝜽) is a differentiable function
∗ Unfortunately there is no analytic solution in general
16
Statistical Machine Learning (S2 2017) Deck 7

Stochastic gradient descent for ANN


Choose initial guess 𝜽𝜽(0) , 𝑘𝑘 = 0
Here 𝜽𝜽 is a set of all weights form all layers

For 𝑖𝑖 from 1 to 𝑇𝑇 (epochs)


For 𝑗𝑗 from 1 to 𝑁𝑁 (training examples)
Consider example 𝒙𝒙𝑗𝑗 , 𝑦𝑦𝑗𝑗
Update: 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖) )

1
Need to compute partial
2
𝐿𝐿 = 𝑧𝑧𝑗𝑗 − 𝑦𝑦𝑗𝑗 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
2 derivatives and
𝜕𝜕𝜐𝜐𝑖𝑖𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗
17
Statistical Machine Learning (S2 2016) Deck 7

Backpropagation
= “backward propagation of errors”

Calculating the gradient


of a loss function

18
Statistical Machine Learning (S2 2017) Deck 7

Backpropagation: start with the chain rule


• Recall that the output 𝑧𝑧 of an ANN is a function
composition, and hence 𝐿𝐿 𝑧𝑧 is also a composition
2 2 2
∗ 𝐿𝐿 = 0.5 𝑧𝑧 − 𝑦𝑦 = 0.5 ℎ(𝑠𝑠) − 𝑦𝑦 = 0.5 𝑠𝑠 − 𝑦𝑦
2 2
∗ = 0.5 ∑𝑝𝑝𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗 − 𝑦𝑦 = 0.5 ∑𝑝𝑝𝑗𝑗=0 𝑔𝑔(𝑟𝑟𝑗𝑗 )𝑤𝑤𝑗𝑗 − 𝑦𝑦 =⋯

• Backpropagation makes use of


this fact by applying the chain
𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

rule for derivatives 𝑢𝑢1


𝑥𝑥2 𝑧𝑧
𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 …
• =
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗

𝑢𝑢𝑝𝑝
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
• =
𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝑥𝑥𝑚𝑚
19
Statistical Machine Learning (S2 2017) Deck 7

Backpropagation: intermediate step


• Apply the chain rule • Now define
𝜕𝜕𝜕𝜕 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧

𝜕𝜕𝜕𝜕
=
𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝛿𝛿 ≡ =
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑠𝑠 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
𝜀𝜀𝑗𝑗 ≡ =
• = 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗

• Here 𝐿𝐿 = 0.5 𝑧𝑧 − 𝑦𝑦 2
𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗
and 𝑧𝑧 = 𝑠𝑠
𝑢𝑢1 Thus 𝛿𝛿 = 𝑧𝑧 − 𝑦𝑦
𝑥𝑥2 𝑧𝑧 𝑝𝑝
… • Here 𝑠𝑠 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗 and

𝑢𝑢𝑝𝑝 𝑢𝑢𝑗𝑗 = ℎ 𝑟𝑟𝑗𝑗
Thus 𝜀𝜀𝑗𝑗 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ (𝑟𝑟𝑗𝑗 )
𝑥𝑥𝑚𝑚
20
Statistical Machine Learning (S2 2017) Deck 7

Backpropagation equations
• We have … where • Recall that
𝑝𝑝

𝜕𝜕𝜕𝜕
= 𝛿𝛿
𝜕𝜕𝜕𝜕
∗ 𝛿𝛿 =
𝜕𝜕𝜕𝜕
= 𝑧𝑧 − 𝑦𝑦 ∗ 𝑠𝑠 = ∑𝑗𝑗=0 𝑢𝑢𝑗𝑗 𝑤𝑤𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕 ∗ 𝑟𝑟𝑗𝑗 = ∑𝑚𝑚
𝑖𝑖=0 𝑥𝑥𝑖𝑖 𝑣𝑣𝑖𝑖𝑖𝑖

𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗
𝜕𝜕𝑟𝑟𝑗𝑗 ∗ 𝜀𝜀𝑗𝑗 = = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ (𝑟𝑟𝑗𝑗 )
𝜕𝜕𝑟𝑟𝑗𝑗
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖

𝜕𝜕𝑠𝑠 𝜕𝜕𝑟𝑟𝑗𝑗
• So = 𝑢𝑢𝑗𝑗 and = 𝑥𝑥𝑖𝑖
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖
𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

𝑢𝑢1 • We have
𝑥𝑥2 𝑧𝑧 𝜕𝜕𝜕𝜕
… ∗ = 𝛿𝛿𝑢𝑢𝑗𝑗 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗

𝑢𝑢𝑝𝑝 ∗
𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗 𝑥𝑥𝑖𝑖 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖

𝑥𝑥𝑚𝑚
21
Statistical Machine Learning (S2 2017) Deck 7

Forward propagation
• Use current • Calculate 𝑟𝑟𝑗𝑗 ,
estimates of 𝑢𝑢𝑗𝑗 , 𝑠𝑠 and 𝑧𝑧
𝑣𝑣𝑖𝑖𝑖𝑖 and 𝑤𝑤𝑗𝑗

𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

𝑢𝑢1 • Backpropagation equations


𝑥𝑥2 𝑧𝑧 𝜕𝜕𝜕𝜕
… ∗ = 𝛿𝛿𝑢𝑢𝑗𝑗 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗

𝑢𝑢𝑝𝑝 ∗
𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗 𝑥𝑥𝑖𝑖 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖

𝑥𝑥𝑚𝑚
22
Statistical Machine Learning (S2 2017) Deck 7

Backward propagation of errors


𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗 𝑥𝑥𝑖𝑖 ′
𝜀𝜀𝑗𝑗 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ (𝑟𝑟𝑗𝑗 ) = 𝛿𝛿𝑢𝑢𝑗𝑗 𝛿𝛿 = (𝑧𝑧 − 𝑦𝑦)
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖 𝜕𝜕𝑤𝑤𝑗𝑗

𝑥𝑥1 𝑣𝑣𝑖𝑖𝑖𝑖 𝑤𝑤𝑗𝑗

𝑢𝑢1 • Backpropagation equations


𝑥𝑥2 𝑧𝑧 𝜕𝜕𝜕𝜕
… ∗ = 𝛿𝛿𝑢𝑢𝑗𝑗 = 𝑧𝑧 − 𝑦𝑦 𝑢𝑢𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗

𝑢𝑢𝑝𝑝 ∗
𝜕𝜕𝜕𝜕
= 𝜀𝜀𝑗𝑗 𝑥𝑥𝑖𝑖 = 𝛿𝛿𝑤𝑤𝑗𝑗 ℎ′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖
𝜕𝜕𝑣𝑣𝑖𝑖𝑖𝑖

𝑥𝑥𝑚𝑚
23
Statistical Machine Learning (S2 2017) Deck 7

Some further notes on ANN training


• ANN is a flexible model (recall universal approximation
theorem), but the flipside of it is over-parameterisation,
hence tendency to overfitting
• Starting weights are usually small random values
distributed around zero
• Implicit regularisation:
early stopping
∗ With some activation
functions, this shrinks
the ANN towards a
linear model (why?) Plot by Geek3
Wikimedia Commons (CC3)

24
Statistical Machine Learning (S2 2017) Deck 7

Explicit regularisation
• Alternatively, an explicit regularisation can be used,
much like in ridge regression
• Instead of minimising the loss 𝐿𝐿, minimise regularised
𝑝𝑝 𝑝𝑝
function 𝐿𝐿 + 𝜆𝜆 ∑𝑚𝑚 ∑ 𝑣𝑣 2
𝑖𝑖=0 𝑗𝑗=1 𝑖𝑖𝑖𝑖 + ∑ 𝑤𝑤
𝑗𝑗=0 𝑗𝑗
2

• This will simply add 2𝜆𝜆𝑣𝑣𝑖𝑖𝑖𝑖 and 2𝜆𝜆𝑤𝑤𝑗𝑗 terms to the partial
derivatives
• With some activation functions this also shrinks the ANN
towards a linear model

25
Statistical Machine Learning (S2 2017) Deck 7

This lecture
• Multilayer perceptron
∗ Model structure
∗ Universal approximation
∗ Training preliminaries

• Backpropagation
∗ Step-by-step derivation
∗ Notes on regularisation

26

You might also like