You are on page 1of 28

Intro to Deep Neural Networks:

MLP and Backpropagation


CS771: Introduction to Machine Learning
Piyush Rai
2
Limitation of Linear Models
▪ Linear models: Output produced by taking a linear combination of input features
Linear regression, logistic
regression, SVM, etc

𝑓 is some linear/nonlinear function


(e.g., identity, sign, or sigmoid)

▪ A basic unit of the form 𝑦 = 𝑓(𝒘⊤ 𝒙) is known as the “Perceptron” (not to be confused
with the Perceptron “algorithm”, which learns a linear classification model)
Although can kernelize to
make them nonlinear
▪ This can’t however learn nonlinear functions or nonlinear decision boundaries
CS771: Intro to ML
3
Neural Networks: Multi-layer Perceptron (MLP)
▪ An MLP is a network containing several Perceptron units across many layers
▪ An MLP consists of an input layer, an output layer, and one or more hidden layers
Output Layer 𝑦ො𝑛
(with a scalar-valued output)
Hidden layer units/nodes
(a.k.a. “neurons”) act as Learnable
new features weights
One hidden Layer
(with K=2 hidden units)

Can think of this model as a


combination of two predictions
ℎ𝑛1 and ℎ𝑛2 of two simple
functions (red and blue weights)

Input Layer The effective 𝑥 to 𝑦


(with D=3 visible units) mapping is nonlinear (will
see justification shortly)
Input layer units/nodes denote
the original features of input 𝒙𝑛 MLP is also called feedforward fully-connected network CS771: Intro to ML
4
Illustration: Neural Net with Single Hidden Layer
▪ Compute 𝐾 pre-activations for each input 𝒙𝑛 𝑦ො𝑛
A linear model with Last layer activation function 𝑜
learnable weight vec 𝒘𝑘 𝐷 can be an identify function too
𝑧𝑛𝑘 = 𝒘⊤
𝑘 𝒙𝑛 = ෍ 𝑤𝑑𝑘 𝑥𝑛𝑑 (𝑘 = 1,2, … , 𝐾) 𝑠𝑛 (e.g., for regression, 𝑦ො𝑛 = 𝑠𝑛 )
𝑑=1 or sigmod/softmax/sign etc

▪ Apply nonlinear activation on each pre-act


Called a hidden unit ℎ𝑛𝑘 = 𝑔(𝑧𝑛𝑘 ) (𝑘 = 1,2, … , 𝐾) ℎ𝑛1 ℎ𝑛2
Hidden layer
▪ Apply a linear model with 𝒉𝑛 acting as features activation function 𝑔
A linear model with must be nonlinear
learnable weight vec 𝒗 𝐾 𝑧𝑛1 𝑧𝑛2
Score of 𝑠𝑛 = 𝒗⊤ 𝒉𝑛 = ෍ 𝑣𝑘 ℎ𝑛𝑘
the input 𝑘=1
▪ Finally, output is produced as
Score converted to the
actual prediction
𝑦ො𝑛 = 𝑜(𝑠𝑛 )
▪ Loss: ℒ 𝑾, 𝒗 = σ𝑁
𝑛=1 ℓ(𝑦𝑛 , 𝑦𝑛
Ƹ )
CS771: Intro to ML
Will denote a linear
5
Neural Nets: A Compact Illustration combination of inputs
followed by a nonlinear
operation on the result

▪ Note: Hidden layer pre-act 𝑧𝑛𝑘 and post-act ℎ𝑛𝑘 will be shown together for brevity

Will directly show


the final output

Will combine pre-act and post-act and


directly show only ℎ𝑛𝑘 to denote the
Single value computed by a hidden node
Hidden More succinctly..
Layer
ℎ𝑛𝑘 = 𝑔(𝒘⊤
𝑘 𝒙𝑛 )

▪ Denoting 𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐾 ], 𝒘𝑘 ∈ ℝ𝐷 , 𝒉𝑛 = 𝑔 𝑾⊤ 𝒙𝑛 ∈ ℝ𝐾 (𝐾 = 2, 𝐷 = 3
above). Note: 𝑔 applied elementwise on pre-activation vector 𝒛𝑛 = 𝑾⊤ 𝒙𝑛 CS771: Intro to ML
6
Activation Functions: Some Common Choices
sigmoid tanh
Preferred more than
For sigmoid as well as tanh, sigmoid. Helps keep the
gradients saturate (become mean of the next layer’s
h close to zero as the function inputs close to zero (with
tends to its extreme values)
h
sigmoid, it is close to 0.5)

a a

ReLU and Leaky ReLU Leaky ReLU Imp: Without nonlinear


ReLU 𝑦 = 𝒗⊤ 𝑔 𝑾⊤ 𝑥
are among the most activation, a deep neural
= 𝒗⊤ 𝑾⊤ 𝑥
popular ones (also network is equivalent to
efficient to compute) a linear model no matter
h Still linear how many layers we use
Helps fix the dead
neuron problem of Most activation functions are
ReLU when 𝑎 is a monotonic but there exist
negative number some non-monotonic
a activation functions as well
(e.g., Swish: 𝑎 × 𝜎(𝛽𝑎))

CS771: Intro to ML
7
MLP Can Learn Any Nonlinear Function
▪ An MLP can be seen as a composition of multiple linear models combined nonlinearly
High-score in the middle and low-
Obtained by composing the two one-sided score score on either of the two sides
increasing score functions (using learnable of it. Exactly what we want for
Score monotonically increases. the given classification problem
One-sided increase (not ideal for weights 𝑣1 and 𝑣2 to “add“ them). This
learning nonlinear decision can now learn the nonlinear decision
boundaries). Just a linear model boundary (high score in the middle) 𝑦ො𝑛
score score
score

A nonlinear
classification problem

Standard Single “Perceptron” Classifier (no hidden units)


A single hidden layer MLP with sufficiently large
A Multi-layer Perceptron Classifier
number of hidden units can approximate any
(one hidden layer with 2 units)
function (Hornik, 1991) CS771:
Capable of learning nonlinear boundaries Intro to ML
8
Superposition of two linear models = Nonlinear model

Two sigmoids (blue and


orange) can be combined via
a shift and a subtraction
operation to result in a
nonlinear separation boundary

Likewise, more than two


𝑝(𝑦 = red|𝑥) sigmoids can be combined to
learn even more sophisticated
separation boundaries

Nonlinear separation boundary

CS771: Intro to ML
9

Examples of some basic NN/MLP architectures

CS771: Intro to ML
10
Single Hidden Layer and Single Output
▪ One hidden layer with 𝐾 nodes and a single output (e.g., scalar-valued regression or
binary classification) ⊤
𝑜 is a suitable function (e.g., identity,
𝑦ො𝑛 = 𝑜(𝒗 𝒉𝑛 )
sign, or sigmoid, etc) to convert the
score into the actual response

𝑔 is a nonlinear
activation function

CS771: Intro to ML
11
Single Hidden Layer and Multiple Outputs
▪ One hidden layer with 𝐾 nodes and a vector of 𝐶 outputs (e.g., vector-valued
regression or multi-class classification or multi-label classification) 𝑜 is a suitable function (e.g.,
identity, sign, or softmax, etc)
to convert the score into the
𝑦ො𝑛1 𝑦ො𝑛2 𝑦ො𝑛𝐶 ෝ𝑛 =
𝒚 𝑜(𝑽⊤ 𝒉𝑛 ) actual response vector for 𝑥𝑛

𝑔 is a nonlinear
activation function

CS771: Intro to ML
12
Multiple Hidden Layers (One/Multiple Outputs)
▪ Most general case: Multiple hidden layers with (with same or different number of
hidden nodes in each) and a scalar or vector-valued output

Each hidden layer uses a


nonlinear activation function 𝑔
(essential, otherwise the
network can’t learn nonlinear
functions and reduces to a
linear model)

Note: Nonlinearity 𝑔
is applied element-
wise on its inputs
(ℓ)
so ℎ𝑛 has the
same size as vector
(ℓ−1)
𝑊 (ℓ) ℎ𝑛
CS771: Intro to ML
13
The Bias Term
▪ Each layer’s pre-activations 𝒛𝑛(ℓ) have a an add bias term
(ℓ) (ℓ)
𝒃 (has the same size as 𝒛𝑛 and 𝒉𝑛 ) as well
(ℓ) 𝑦ො𝑛

(ℓ) ℓ ⊤ (ℓ−1)
𝒛𝑛 =𝑾 𝒉𝑛 + 𝒃(ℓ) 𝒗

(2)
𝒉𝑛
(ℓ) ℓ
𝒉𝑛 = 𝑔(𝒛𝑛 ) 𝑾(2)
(1)
𝒉𝑛
▪ Bias term increases the expressiveness of the network
and ensures that we have nonzero activations/pre- 𝑾(1)
activations even if this layer’s input is a vector of all zeros 𝒙𝑛

▪ Note that the bias term is the same for all inputs (does
not depend on 𝑛)

▪ The bias term 𝒃(ℓ) is also learnable CS771: Intro to ML


14
Neural Nets are Feature Learners
▪ Hidden layers can be seen as learning a feature rep. 𝜙 𝒙𝑛 for each input 𝒙𝑛

The last (𝐿𝑡ℎ ) hidden


(𝐿)
layer’s nodes’ values 𝒉𝑛

CS771: Intro to ML
Also note that neural nets are faster
15
Kernel Methods vs Neural Nets than kernel methods at test time
since kernel methods need to store
the training examples at test time
whereas neural nets do not
▪ Recall the prediction rule for a kernel method (e.g., kernel SVM)
𝑁
𝑦𝑛 = 𝒘⊤ 𝜙(𝒙𝑛 ) OR 𝑦𝑛 = ෍ 𝛼𝑛 𝑘(𝒙𝑖 , 𝒙𝑛 )
𝑖=1
▪ It’s like a one hidden layer NN with
▪ Pre-defined 𝑁 features 𝑘 𝒙𝑖 , 𝒙𝑛 𝑁 𝑖=1 acting as feature vector 𝒉𝑛
▪ 𝛼𝑖 𝑁𝑛=1 are learnable output layer weights

▪ It’s also like a one hidden layer NN with


▪ Pre-defined 𝑀 features 𝜙(𝒙𝑛 ) (𝑀 being size of feature mapping 𝜙) acting as feature vector 𝒉𝑛
▪ 𝒘 ∈ ℝ𝑀 are learnable output layer weights

▪ Both kernel methods and deep neural networks extract new features from the inputs
▪ For kernel methods, the features are pre-defined via the kernel function
▪ For deep NN, the features are learned by the network
▪ Note: Kernels can also be learned from data (“kernel learning”) in which case kernel
methods and deep neural nets become even more similar in spirit ☺ CS771: Intro to ML
𝑾(ℓ) = [𝒘1 , 𝒘2 , … , 𝒘𝐾16
ℓ ℓ ℓ
] is
Features Learned by a Neural Network ℓ
𝐾ℓ−1 × 𝐾ℓ matrix of weights
between layer ℓ − 1 and ℓ
(ℓ)
▪ 𝒘𝑘 ∈ ℝ𝐾ℓ−1 denotes “feature detector” for feature 𝑘 of layer ℓ For
𝑾(ℓ) collectively represents the
𝐾ℓ feature detectors for layer ℓ
(ℓ) ℓ ⊤ (ℓ−1)
▪ For input 𝒙𝑛 , ℎ𝑛𝑘 = 𝑔(𝒘𝑘 𝒉𝑛 ) is the value of feature 𝑘 in layer ℓ

All the incoming weights (a


(2)
(3) vector) to a hidden node can E.g., 𝒘32 denotes a
𝒉𝑛 be seen as representing a feature detector
pattern/feature-detector of
previous layer’s inputs
(2)
𝒉𝑛 All the incoming weights (a
vector) to a hidden node can (1)
E.g., 𝒘100 denotes a
be seen as representing a
(1) pattern/feature-detector of feature detector
𝒉𝑛 previous layer’s inputs

𝐾0 = 𝐷

CS771: Intro to ML
17
Why Neural Networks Work Better: Another View
▪ Linear models tend to only learn the “average” pattern
▪ E.g., Weight vector of a linear classification model represent average pattern of a class
▪ Deep models can learn multiple patterns (each hidden node can learn one pattern)
▪ Thus deep models can learn to capture more subtle variations that a simpler linear model

CS771: Intro to ML
18
Neural Nets: Some Aspects
▪ Much of the magic lies in the hidden layers

▪ Hidden layers learn and detect good features


Choosing the right NN architecture is
important and a research area in itself.
Neural Architecture Search (NAS) is an
▪ Need to consider a few aspects automated technique to do this

▪ Number of hidden layers, number of units in each hidden layer


▪ Why bother about many hidden layers and not use a single very
wide hidden layer (recall Hornik’s universal function
approximator theorem)?
▪ Complex networks (several, very wide hidden layers) or simpler
networks (few, moderately wide hidden layers)?
▪ Aren’t deep neural network prone to overfitting (since they
contain a huge number of parameters)?

CS771: Intro to ML
19
Representational Power of Neural Nets
▪ Consider a single hidden layer neural net with 𝐾 hidden nodes
K=3 K=6 K = 20

▪ Recall that each hidden unit “adds” a simple function to the overall function
▪ Increasing 𝐾 (number of hidden units) will result in a more complex function
▪ Very large 𝐾 seems to overfit (see above fig). Should we instead prefer small 𝐾?
▪ No! It is better to use large 𝐾 and regularize well. Reason/justification:
▪ Simple NN with small 𝐾 will have a few local optima, some of which may be bad
▪ Complex NN with large 𝐾 will have many local optimal, all equally good (theoretical results on this)
▪ We can also use multiple hidden layers (each sufficiently large) and regularize well
CS771: Intro to ML
20
Wide or Deep?
▪ While very wide single hidden layer can approx. any function, often we prefer many,
less wide, hidden layers

▪ Higher layers help learn more directly useful/interpretable features (also useful for
compressing data using a small number of features)
CS771: Intro to ML
21

Training Deep Neural Networks:


The Backpropagation Algorithm

CS771: Intro to ML
22
Background: Gradient and Jacobian
▪ Let 𝒚 = 𝑓(𝒙), where 𝑓: ℝ𝑃 → ℝ𝑄 , 𝒙 ∈ ℝ𝑃 , 𝒚 ∈ ℝ𝑄 . Denote 𝒚 = [𝑓1 𝒙 , … , 𝑓𝑄 𝒙 ]
▪ The gradient of each component 𝑦𝑖 = 𝑓𝑖 𝒙 ∈ ℝ (𝑖 = 1,2, … , 𝑄) w.r.t. 𝒙 ∈ ℝ𝑃 is
Note: Gradient expressed here as a
𝜕𝑦𝑖 𝜕𝑦𝑖 𝜕𝑦𝑖 row vector (has the same length as 𝒙
∇𝑓𝑖 𝒙 = = … ∈ ℝ1×𝑃 which is a column vector) for
𝜕𝒙 𝜕𝑥1 𝜕𝑥𝑃 notational convenience later

▪ Likewise, the gradient of whole vector 𝒚 ∈ ℝ𝑄 w.r.t. vector 𝒙 ∈ ℝ𝑃 can be defined


using the 𝑄 × 𝑃 Jacobian matrix 𝐽 𝑓 whose rows consist of the above gradients
Row 𝑖 contains the gradient For a function 𝑓: ℝ𝑃 → ℝ𝑄 , the
vector of 𝑦𝑖 = 𝑓𝑖 𝒙 w.r.t. 𝒙 Jacobian 𝐽𝑓 ∈ ℝ𝑄×𝑃 (a matrix)

𝜕𝑦𝑖 ∇𝑓1 (𝒙)


𝜕𝒚 𝑓 𝑓 𝑄×𝑃
Likewise, for a function 𝑓: ℝ 𝑃×𝑄
→ ℝ𝑅×𝑆 ,
= 𝐽𝑓 𝐽𝑖𝑗 = 𝐽 = ⋮ ∈ ℝ the Jacobian 𝐽 ∈ ℝ 𝑓 𝑅×𝑆×𝑃×𝑄
(4D tensor)
𝜕𝒙 𝜕𝑥𝑗 ∇𝑓𝑄 (𝒙) 𝐼1 ×𝐼2 ×⋯
More generally, for a function 𝑓: ℝ → ℝ𝑂1 ×𝑂2×⋯ ,
its Jacobian 𝐽𝑓 ∈ ℝ𝑂1 ×𝑂2 ×..×𝐼1 ×𝐼2 ×⋯ (a tensor)

CS771: Intro to ML
23
Background: Multivariate Chain Rule of Calculus
▪ Let 𝒙 ∈ ℝ𝑃 , 𝒚 = 𝑔(𝒙) ∈ ℝ𝑄 , 𝑧 = 𝑓(𝒚) ∈ ℝ, where 𝑔: ℝ𝑃 → ℝ𝑄 , 𝑓: ℝ𝑄 → ℝ
Scalar derivative 1 × 𝑃 vector derivative
1 × 𝑃 vector
derivative 𝜕𝑧 𝑄 𝜕𝑧 𝜕𝑦 Used chain rule of total derivatives
g f 𝑖
= ෍ Sum is needed since 𝑧 depends
𝒙 𝒚
𝑧
𝜕𝒙 𝑖=1 𝜕𝑦𝑖 𝜕𝒙 on 𝒚, and 𝒚 is a vector
▪ The above can be written as a product of a vector and a matrix
1×𝑃 Turns out to be a product of 1 × 𝑄 gradient (same as
𝑄 × 𝑃 Jacobian of 𝑔
𝜕𝑦1
Jacobian of 𝑓 and 𝑔 in that order ☺ Jacobian since 𝑓 is scalar)

𝜕𝑧 𝜕𝑧 𝜕𝒙 ∇𝑔1 (𝒙)
𝜕𝑧
= … × ⋮ = ∇𝑓(𝒚) × ⋮ = ∇𝑓(𝒚) × 𝐽 𝑔
𝜕𝒙 𝜕𝑦1 𝜕𝑦𝑄 𝜕𝑦𝑄 ∇𝑔𝑃 (𝒙)
𝜕𝒙
▪ More generally, let 𝒘 ∈ ℝ𝑃 , 𝒙 = ℎ(𝒘) ∈ ℝ𝑄 , 𝒚 = 𝑔(𝒙) ∈ ℝ𝑅 , 𝒛 = 𝑓(𝒚) ∈ ℝ𝑆
Note that chain rule for scalar
Product of the 3 Jacobians 𝜕𝒛
in that order (simple! ☺) = 𝐽 𝑓 × 𝐽 𝑔 × 𝐽ℎ ∈ ℝ𝑆×𝑃 variables 𝑤, 𝑥, 𝑦, 𝑧 is defined in a
similar way as
𝜕𝒛
= 𝑓 ′ (𝑦)𝑔′ (𝑥)ℎ′(𝑤)
𝜕𝒘 𝜕𝒘 CS771: Intro to ML
24
Backpropagation (Backprop)
𝑦ො𝑛
▪ Backprop is gradient descent with multivariate chain rule for derivatives
▪ Consider a two hidden layer neural network 𝒗
(2) (2)
ℒ 𝑾(1) , 𝑾(2) , 𝒗 = σ𝑁 ො𝑛 ) = σ𝑁
𝑛=1 ℓ(𝑦𝑛 , 𝑦 𝑛=1 ℓ𝑛 𝒉𝑛 = 𝑔(𝒛𝑛 )

𝑾(2)
▪ We wish to minimize the loss
(1) (1)
𝒉𝑛 = 𝑔(𝒛𝑛 )
▪ The gradient based updates will be
𝜕ℒ 𝜕ℒ 𝑾(1)
𝒗=𝒗−𝜂 𝑾(𝑖) = 𝑾(𝑖) − 𝜂 (𝑖 = 1,2)
𝜕𝒗 𝜕𝑾(𝑖) 𝒙𝑛
𝜕ℓ𝑛 𝜕ℓ𝑛
▪ Since ℒ = σ𝑁
𝑛=1 ℓ𝑛 , we need to compute and (𝑖 = 1,2)
𝜕𝒗 𝜕𝑾(𝑖)

(2)
▪ Assume output activation 𝑜 as identity (𝑦ො𝑛 = 𝒗⊤ 𝒉𝑛 )
Derivative of ℓ𝑛 w.r.t. 𝑦ො𝑛
𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕𝑦ො𝑛 (2)
= = ℓ′ (𝑦𝑛 , 𝑦ො𝑛 )𝒉𝑛
𝜕𝒗 𝜕𝑦ො𝑛 𝜕𝒗 CS771: Intro to ML
25
Backpropagation in detail
𝜕ℓ𝑛 (2) 𝑦ො𝑛
▪ Let’s now look at where ℓ𝑛 = ℓ(𝑦𝑛 , 𝑦ො𝑛 ) and 𝑦ො𝑛 = 𝒗⊤ 𝒉𝑛
𝜕𝑾(2)
𝒗
𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕 𝑦ො𝑛 ′
𝜕𝑦ො𝑛
= = ℓ (𝑦𝑛 , 𝑦ො𝑛 ) (2) (2)
𝜕𝑾(2) 𝜕𝑦ො𝑛 𝜕𝑾(2) 𝜕𝑾(2) 𝒉𝑛 = 𝑔(𝒛𝑛 )
(2)
𝜕𝑦ො𝑛 𝜕𝑦ො𝑛 𝜕𝒉𝑛 𝜕𝑦ො𝑛 𝜕𝒗 𝑾(2)
= +
𝜕𝑾(2) 𝜕𝒉(2) 𝜕𝑾(2) 𝜕𝒗 𝜕𝑾(2) (1)
𝒉𝑛 = 𝑔(𝒛𝑛 )
(1)
𝑛
𝜕𝒗 Using transpose since
▪ Since 𝑣 doesn’t depend on 𝑾(2) , =0 we assume gradient to
𝑾(1)
𝜕𝑾(2) be a row vector
Jacobian of size (2) (2)
1 × 𝐾2 × 𝐾1 𝜕ℓ𝑛 ′
𝜕𝑦ො𝑛 𝜕𝒉𝑛 𝜕𝒉 𝑛 𝒙𝑛
= ℓ (𝑦𝑛 , 𝑦ො𝑛 ) (2) (2) = ℓ′ 𝑦𝑛 , 𝑦ො𝑛 𝒗⊤
𝜕𝑾(2) 𝜕𝒉 𝜕𝑾 𝜕𝑾(2)
(2) 𝑛

▪ We now need
𝜕𝒉𝑛
. Using
(2) 2
= 𝑔(𝒛𝑛 ) where 𝒛𝑛 =
𝒉𝑛
2
𝑾 2 ⊤ 𝒉(1) and 𝑔 is elementwise
𝜕𝑾(2) 𝑛
2
applied nonlinearity on the vector 𝒛𝑛
(2) (2) (2) (2) This Jacobian is a tensor
Jacobian of size 𝜕𝒉𝑛 𝜕𝒉𝑛 𝜕𝒛𝑛 𝜕𝒛𝑛 of size 𝐾2 × 𝐾2 × 𝐾1
′ 𝑧 2 , … , 𝑔′ 𝑧 2
𝐾2 × 𝐾2 × 𝐾1 = = diag 𝑔 𝑛1 𝑛𝐾2
𝜕𝑾(2) 𝜕𝒛 2 𝜕𝑾(2) 𝜕𝑾(2)
𝑛
Diagonal matrix of size 𝐾2 × 𝐾2 with Jacobian
(gradient vector) of g along the diagonals CS771: Intro to ML
26
Backpropagation: Computation Reuse
𝑦ො𝑛
▪ Summarizing, the required gradients/Jacobians for this network are
𝒗
𝜕ℓ𝑛 ′ (2)
= ℓ 𝑦𝑛 , 𝑦ො𝑛 𝒉𝑛 (2)
𝒉𝑛 = 𝑔(𝒛𝑛 )
(2)
𝜕𝒗
(2) (2)
𝜕ℓ𝑛 ′
𝜕 𝑦
ො𝑛 𝜕𝒉𝑛 𝜕𝒛𝑛 𝑾(2)
= ℓ (𝑦 ,
𝑛 𝑛𝑦
ො )
𝜕𝑾(2) (2)
𝜕𝒉 𝜕𝒛 𝜕𝑾
2 (2) (1)
𝒉𝑛 = 𝑔(𝒛𝑛 )
(1)
𝑛 𝑛
(2) (2) (1) (1)
𝜕ℓ𝑛 𝜕 ො
𝑦 𝜕𝒉 𝜕𝒛 𝜕𝒉 𝜕𝒛
= ℓ′ (𝑦𝑛 , 𝑦ො𝑛 ) (2)𝑛 𝑛
2
𝑛
(1)
𝑛
1
𝑛 𝑾(1)
𝜕𝑾(1) 𝜕𝒉𝑛 𝜕𝒛𝑛 𝜕𝒉𝑛 𝜕𝒛𝑛 𝜕𝑾(1) 𝒙𝑛

▪ Thus gradient computations done in upper layers can be stored


and reused when computing the gradients in the lower layers
(libraries like Tensorflow and Pytorch do so efficiently) Gradients in lower layers will have product of many such
(𝑖)
𝜕𝒉𝑛
(𝑖) terms . If 𝑔′ is small (e.g., gradient of sigmoid or
𝜕𝒉𝑛 𝑖

▪ Vanishing gradients: 𝑖 = diag 𝑔′ 𝑧𝑛1 , … , 𝑔′ 𝑧𝑛𝐾𝑖


𝑖 𝑖 𝜕𝒛𝑛
tanh), the gradient becomes vanishingly small for lower
𝜕𝒛𝑛 layers and becomes an issue (thus ReLU and other with
non-saturating activations are preferred)
CS771: Intro to ML
27
Backpropagation Forward pass computes
hidden units and the loss
▪ Backprop iterates between a forward pass and a backward pass using current values of the
network weights 𝑾 and 𝒗
Backward pass uses the loss
and hidden units’ values to
compute the gradient of the
loss, starting with weights 𝒗
in the last layer, updates the
weights, and proceeds to do
the same for other layers

Backward Pass
Forward Pass

▪ Software frameworks such as Tensorflow and PyTorch support this already so you
don’t need to implement it by hand (so no worries of computing derivatives etc)
CS771: Intro to ML
28
Backpropagation through an example

CS771: Intro to ML

You might also like