Lecture 7

Lecture 7
Math Foundations Team

Introduction
▶ In last lecture, we discussed about differentiation of univariate

functions, partial differentiation, gradients and gradients of
vector valued functions.
▶ Now we will look into gradients of matrices and some useful
identities for computing gradients.
▶ Finally, we will discuss back propagation and automatic
differentiation.
Gradients of Matrices
The gradient of an m × n matrix A with respect to a p × q matrix

B, the resulting Jacobian would be an (m × n) × (p × q), i.e., a
four-dimensional tensor J, whose entries are given as
∂Aij
Jijkl =
∂Bkl
Since, we can consider Rm×n as Rmn , we can shape our matrix

into vectors of length mn and pq respectively. The gradient using
mn vectors results in a Jacobian of size mn × pq
Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then

∂f
∈ Rm×(m×n)
∂A
By definition  ∂f 
1
∂A
∂f   ∂fi
=  ...  , ∈ R1×(m×n)
∂A ∂f
∂A
m
∂A
Now, we have
n
X
fi = Aij xj , i = 1, · · · , m.
j=1
Therefore, by taking partial derivatives with respect to Aiq
∂fi
= xq .
∂Aiq
Hence, i th row becomes

∂fi
= x T ∈ R1×1×n
∂Ai,:
∂fi
= 0T ∈ R1×1×n , fork ̸= i
∂Ak,:
Hence, by stacking the partial derivatives, we get
 T
0
 .. 
 . 
∂fi  T 1×m×n
=
∂Ak,: 
x ∈R
 .. 
 . 
0T
Gradients of Matrices with respect to Matrices
Let B ∈ Rm×n and f : Rm×n → Rn×n with
f (B) = B T B =: K ∈ Rn×n
Then, we have
∂K
∈ R(n×n)×(m×n) .
∂B
Moreover
∂Kpq
∈ R1×(m×n) , forp, q = 1, · · · n
∂B
where Kpq is the (p, q)th entry of K = f (B)
Let i th column of B be bi , then

m
X
Kpq = rpT rq = Blp Blq
l=1
Computing the partial derivative, we get

m
∂Kpq X ∂
= Blp Blq = ∂pqij
∂Bij ∂Bij
l=1
Clearly, we have
∂pqij = Biq if j = p, p ̸= q
∂pqij = Bip if j = q, p ̸= q
∂pqij = 2Biq if j = p, p = q
∂pqij = 0 otherwise
where p, q, j = 1, · · · , n i = 1, · · · , m
Useful Identities for Computing Gradients
∂ T ∂f (X ) T
▶ ∂X f (X ) = ( ∂X )
∂ ∂f (X )
▶ ∂X tr (f (X )) = tr ( ∂X )
▶ ∂ −1 ∂f (X ) )
∂X det(f (X )) = det(f (X ))tr (f (X ) ∂X
∂ −1 −1 ∂f (X ) −1
▶ ∂X f (X ) = −f (X ) ∂X f (X )
Useful Identities for Computing Gradients
∂aT X −1 b
▶ ∂X = −(X −1 )T ab T (X −1 )T
▶ ∂x T a T
∂x = a
▶ ∂aT x T
∂x = a
T
∂a Xb
▶ T
∂X = ab
▶ ∂x T B T T
∂x = x (B + B )
∂ T
▶ ∂s (x − As) W (x − As) = −2(x − As)T WA
for symmetric W .
Backpropagation and Automatic Differentiation
Consider the function

q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))
Taking dervatives
df 2x + 2xexp(x 2 )
= p − sin(x 2 + exp(x 2 ))(2x + 2xexp(x 2 ))
dx 2 x 2 + exp(x 2 )
1
= 2x( p − sin(x 2 + exp(x 2 )))(1 + exp(x 2 ))
2 x + exp(x 2 )
2
Motivation
▶ The implementation of the gradient could be significantly

more expensive than computing the function, which imposes
unnecessary overhead where we get such lengthy expressions.
▶ We need an efficient way to compute the gradient of an error
function with respect to the parameters of the model.
▶ For training deep neural network models, the backpropagation
algorithm is one such method.
Backpropagation and Automatic Differentiation
In neural networks with multiple layers
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of layer i − 1 and σ is an activation

function.
Backpropagation
To train these model, the gradient of the loss function L with

respect to all model parameters θj = {Aj , bj }, j = 1, · · · , K and
inputs of each layer needs to be computed. Consider,
f0 := x
fi := σi (Ai−1 fi−1 + bi−1 ), i = 1, · · · , K .
We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that
L(θ) = ||y − fK (θ, x)||2
is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }

Backpropagation
Using the chain rule, we get

∂L ∂L ∂fK
=
∂θK −1 ∂fK ∂θK −1
∂L ∂L fK ∂fK −1
=
∂θK −2 ∂fK fK −1 ∂θK −2
∂L ∂L fK ∂fK −1 ∂fK −2
=
∂θK −3 ∂fK fK −1 ∂fK −2 ∂θK −3
∂L ∂L fK ∂fi+2 ∂fi+1
= ···
∂θi ∂fK fK −1 ∂fi+1 ∂θi
Backpropagation
∂L
If the partial derivatives
∂θi+1 are computed, then the computation
∂L
can be reused to compute ∂θ i
.
Example
Consider the function

q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))
Let
√
a = x 2 , b = exp(a), c = a + b, d = c, e = cos(c) ⇒ f = d + e
Example
∂a
⇒ = 2x
∂x
∂b
= exp(a)
∂a
∂c ∂c
= 1=
∂a ∂b
∂d 1
= √
∂c 2 c
∂e
= −sin(c)
∂c
∂f ∂f
= 1=
∂d ∂e
Example
Thus, we have
∂f ∂f ∂d ∂f ∂e
= +
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
=
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= +
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
=
∂x ∂a ∂x
Example
Substituting the results, we get

∂f 1
= 1.( √ + 1).(−sin(c))
∂c 2 c
∂f ∂f
= .1
∂b ∂c
∂f ∂f ∂f
= exp(a) + .1
∂a ∂b ∂c
∂f ∂f
= 2x
∂x ∂a
Thus, the computation for calculating the derivative is of similar

complexity as the computation of the function itself.
Formalization of Automatic Differentiation
Let x1 , · · · , xd : input variables.

xd+1 , · · · , xD−1 : intermediate variables.
xD : output variable, then we have,
xi = gi (xPa(xi ) )
Note that gi s are elementary functions and are also called as

forward propagation function and xPa(xi ) is the set of parent nodes
of variable xi in the graph.
Formalization of Automatic Differentiation
Now,
∂f
f = xD ⇒ =1
∂D
For other variables, using chain rule, we get
∂f X ∂f ∂xj X ∂f ∂gi
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj :xi ∈Pa(xj ) xj :xi ∈Pa(xj )
The last equation is the back propagation of the gradient through

the computation graph. For neural network training, we back
propagate the error of the prediction with respect to the label.

Lecture 7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 7

Uploaded by

Copyright:

Available Formats

Lecture 7

Math Foundations Team

▶ In last lecture, we discussed about differentiation of univariate

The gradient of an m × n matrix A with respect to a p × q matrix

Since, we can consider Rm×n as Rmn , we can shape our matrix

Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then

Therefore, by taking partial derivatives with respect to Aiq

Hence, i th row becomes

Let B ∈ Rm×n and f : Rm×n → Rn×n with

Let i th column of B be bi , then

Computing the partial derivative, we get

Consider the function

▶ The implementation of the gradient could be significantly

In neural networks with multiple layers

fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )

where xi−1 is the output of layer i − 1 and σ is an activation

To train these model, the gradient of the loss function L with

We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that

L(θ) = ||y − fK (θ, x)||2

is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }

Using the chain rule, we get

Consider the function

Substituting the results, we get

Thus, the computation for calculating the derivative is of similar

Let x1 , · · · , xd : input variables.

Note that gi s are elementary functions and are also called as

The last equation is the back propagation of the gradient through

You might also like