Professional Documents
Culture Documents
Vision
Jochen Lang
jlang@uottawa.ca
• Multi-layer perceptron
• Feed forward networks
• Activation functions
• Loss function
• Training by back propagation
∈ℳ
∈ℳ
∈ℳ
– Loop over the training points and evaluate the gradient
after each training point and update the line estimate
Hidden Layer
Input Layer
logit or softmax
Hidden Layer 𝑍 𝑍 𝑍
Input Layer 𝑋 𝑋
Output Layer
Hidden Layer
Input Layer
Hidden Layer
Input Layer
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Matrix Notation
• In practice NN are conveniently expressed as matrix-vector
multiplies
• Our network example
– 𝑇 = 𝛽 + 𝜷 𝒁 is a matrix equation (here 𝜷 is a vector because
we have a single output)
𝛽 1 1
𝛽 𝑧 𝑇 𝛽 𝛽 𝛽 𝛽 𝑧
𝑇 = and =
𝛽 𝑧 𝑇 𝛽 𝛽 𝛽 𝛽 𝑧
𝛽 𝑧 𝑧
– 𝑍 = 𝔞 𝛼 + 𝜶 𝑿 with (here) 𝑀 = 3 leads to a matrix
equation, 𝔞 is an element-wise activation function
𝑧 𝛼 𝛼 𝛼 1
𝑧 =𝔞 𝛼 𝛼 𝛼 𝑥
𝑧 𝛼 𝛼 𝛼 𝑥
Hidden Layer
• Approach
– Separate input at a node and output after application
of non-linearity, consider
Output Layer
– And hence overall
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Partial Differentials
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Backward Pass: Calculate the
Updates
– Calculate the partial derivatives for all weights
Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Apply the Updates to the Weights
* *
3 2 *
x y
Jochen Lang, EECS
jlang@uOttawa.ca
Symbolic Differentiation Example
,
• Assume we want to find
– We know (code) simple rules, e.g., sum rule, product
rule, partial derivative of a variable itself etc. (in
total there are not many)
– We find the derivative for each node and expand the
graph based (in this example) with the product and
sum rule
3 x
* y
+
Jochen Lang, EECS
jlang@uOttawa.ca
Example Result
+ +
* * * *
0 x 3 1 *
+
0 y x
* *
2 0 x y 1
Jochen Lang, EECS
jlang@uOttawa.ca
Simplify
,
• The expression
and its graph are correct but can obviously simplified to
,
. Automatic simplification is a bit harder.
• Symbolic differentiation can lead to lengthy expressions,
even if simplification is successful.
• Not all network nodes are nicely differentiable (e.g.,
activation layer with ReLu)
• Example again: at
* *
3 * 2 *
x y
• Example again: at
• Result of forward pass in blue
• Setup for reverse mode
* *
3 * 2 *
x y
Jochen Lang, EECS
jlang@uOttawa.ca
Differentials – Part I
* *
3 * 2 *
x y
Jochen Lang, EECS
jlang@eecs.uOttawa.ca
Differentials – Part II
+
=
* *
3 * 2 *
x y
Jochen Lang, EECS
jlang@eecs.uOttawa.ca
Observation for AutoDiff
• Multi-layer perceptron
• feed forward networks
• activation functions
• loss function
• training by back propagation