You are on page 1of 24

Lecture 7

Math Foundations Team


Introduction

▶ In last lecture, we discussed about differentiation of univariate


functions, partial differentiation, gradients and gradients of
vector valued functions.
▶ Now we will look into gradients of matrices and some useful
identities for computing gradients.
▶ Finally, we will discuss back propagation and automatic
differentiation.
Gradients of Matrices

The gradient of an m × n matrix A with respect to a p × q matrix


B, the resulting Jacobian would be an (m × n) × (p × q), i.e., a
four-dimensional tensor J, whose entries are given as
∂Aij
Jijkl =
∂Bkl

Since, we can consider Rm×n as Rmn , we can shape our matrix


into vectors of length mn and pq respectively. The gradient using
mn vectors results in a Jacobian of size mn × pq
Gradients of Matrices
Gradients of Matrices

Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then


∂f
∈ Rm×(m×n)
∂A
By definition  ∂f 
1
∂A
∂f   ∂fi
=  ...  , ∈ R1×(m×n)
∂A ∂f
∂A
m
∂A
Gradients of Matrices

Now, we have
n
X
fi = Aij xj , i = 1, · · · , m.
j=1

Therefore, by taking partial derivatives with respect to Aiq

∂fi
= xq .
∂Aiq

Hence, i th row becomes


∂fi
= x T ∈ R1×1×n
∂Ai,:
∂fi
= 0T ∈ R1×1×n , fork ̸= i
∂Ak,:
Hence, by stacking the partial derivatives, we get
 T
0
 .. 
 . 
∂fi  T 1×m×n
=
∂Ak,: 
x ∈R
 .. 
 . 
0T
Gradients of Matrices with respect to Matrices

Let B ∈ Rm×n and f : Rm×n → Rn×n with

f (B) = B T B =: K ∈ Rn×n

Then, we have
∂K
∈ R(n×n)×(m×n) .
∂B
Moreover
∂Kpq
∈ R1×(m×n) , forp, q = 1, · · · n
∂B
where Kpq is the (p, q)th entry of K = f (B)
Gradients of Matrices with respect to Matrices

Let i th column of B be bi , then


m
X
Kpq = rpT rq = Blp Blq
l=1

Computing the partial derivative, we get


m
∂Kpq X ∂
= Blp Blq = ∂pqij
∂Bij ∂Bij
l=1
Gradients of Matrices with respect to Matrices

Clearly, we have

∂pqij = Biq if j = p, p ̸= q
∂pqij = Bip if j = q, p ̸= q
∂pqij = 2Biq if j = p, p = q
∂pqij = 0 otherwise

where p, q, j = 1, · · · , n i = 1, · · · , m
Useful Identities for Computing Gradients

∂ T ∂f (X ) T
▶ ∂X f (X ) = ( ∂X )
∂ ∂f (X )
▶ ∂X tr (f (X )) = tr ( ∂X )
▶ ∂ −1 ∂f (X ) )
∂X det(f (X )) = det(f (X ))tr (f (X ) ∂X
∂ −1 −1 ∂f (X ) −1
▶ ∂X f (X ) = −f (X ) ∂X f (X )
Useful Identities for Computing Gradients

∂aT X −1 b
▶ ∂X = −(X −1 )T ab T (X −1 )T
▶ ∂x T a T
∂x = a
▶ ∂aT x T
∂x = a
T
∂a Xb
▶ T
∂X = ab
▶ ∂x T B T T
∂x = x (B + B )
∂ T
▶ ∂s (x − As) W (x − As) = −2(x − As)T WA
for symmetric W .
Backpropagation and Automatic Differentiation

Consider the function


q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))

Taking dervatives

df 2x + 2xexp(x 2 )
= p − sin(x 2 + exp(x 2 ))(2x + 2xexp(x 2 ))
dx 2 x 2 + exp(x 2 )
1
= 2x( p − sin(x 2 + exp(x 2 )))(1 + exp(x 2 ))
2 x + exp(x 2 )
2
Motivation

▶ The implementation of the gradient could be significantly


more expensive than computing the function, which imposes
unnecessary overhead where we get such lengthy expressions.
▶ We need an efficient way to compute the gradient of an error
function with respect to the parameters of the model.
▶ For training deep neural network models, the backpropagation
algorithm is one such method.
Backpropagation and Automatic Differentiation

In neural networks with multiple layers

fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )

where xi−1 is the output of layer i − 1 and σ is an activation


function.
Backpropagation

To train these model, the gradient of the loss function L with


respect to all model parameters θj = {Aj , bj }, j = 1, · · · , K and
inputs of each layer needs to be computed. Consider,

f0 := x
fi := σi (Ai−1 fi−1 + bi−1 ), i = 1, · · · , K .

We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that

L(θ) = ||y − fK (θ, x)||2

is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }


Backpropagation

Using the chain rule, we get


∂L ∂L ∂fK
=
∂θK −1 ∂fK ∂θK −1
∂L ∂L fK ∂fK −1
=
∂θK −2 ∂fK fK −1 ∂θK −2
∂L ∂L fK ∂fK −1 ∂fK −2
=
∂θK −3 ∂fK fK −1 ∂fK −2 ∂θK −3
∂L ∂L fK ∂fi+2 ∂fi+1
= ···
∂θi ∂fK fK −1 ∂fi+1 ∂θi
Backpropagation

∂L
If the partial derivatives
∂θi+1 are computed, then the computation
∂L
can be reused to compute ∂θ i
.
Example

Consider the function


q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))

Let

a = x 2 , b = exp(a), c = a + b, d = c, e = cos(c) ⇒ f = d + e
Example

∂a
⇒ = 2x
∂x
∂b
= exp(a)
∂a
∂c ∂c
= 1=
∂a ∂b
∂d 1
= √
∂c 2 c
∂e
= −sin(c)
∂c
∂f ∂f
= 1=
∂d ∂e
Example

Thus, we have
∂f ∂f ∂d ∂f ∂e
= +
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
=
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= +
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
=
∂x ∂a ∂x
Example

Substituting the results, we get


∂f 1
= 1.( √ + 1).(−sin(c))
∂c 2 c
∂f ∂f
= .1
∂b ∂c
∂f ∂f ∂f
= exp(a) + .1
∂a ∂b ∂c
∂f ∂f
= 2x
∂x ∂a

Thus, the computation for calculating the derivative is of similar


complexity as the computation of the function itself.
Formalization of Automatic Differentiation

Let x1 , · · · , xd : input variables.


xd+1 , · · · , xD−1 : intermediate variables.
xD : output variable, then we have,

xi = gi (xPa(xi ) )

Note that gi s are elementary functions and are also called as


forward propagation function and xPa(xi ) is the set of parent nodes
of variable xi in the graph.
Formalization of Automatic Differentiation

Now,
∂f
f = xD ⇒ =1
∂D
For other variables, using chain rule, we get
∂f X ∂f ∂xj X ∂f ∂gi
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj :xi ∈Pa(xj ) xj :xi ∈Pa(xj )

The last equation is the back propagation of the gradient through


the computation graph. For neural network training, we back
propagate the error of the prediction with respect to the label.

You might also like