Matrix Differentiation Rules and Application

Matrix Differentiation
CS5240 Theoretical Foundations in Multimedia
Leow Wee Kheng
Department of Computer Science

School of Computing
National University of Singapore
Leow Wee Kheng (NUS) Matrix Differentiation 1 / 36

Linear Fitting Revisited

Linear fitting solves this problem:
Given n data points pi = [xi1 · · · xim ]⊤ , 1 ≤ i ≤ n, and their
corresponding values vi , find a linear function f that
minimizes the error
n
X
E= (f (pi ) − vi )2 . (1)
i=1
The linear function f (pi ) has the form
f (p) = f (x1 , . . . , xm ) = a1 x1 + · · · + am xm + am+1 . (2)

The data points are organized into a matrix equation
D a = v, (3)
where
 
  a1  
x11 · · · x1m 1 .. v1
 .. .. ..  , a =  ..  . (4)
 
D= . .. .  , and v = 
 
. . .   . 
 am 
xn1 · · · xnm 1 vn
am+1
The solution of Eq. 3 is
a = (D⊤ D)−1 D⊤ v. (5)

Denote each row of D as d⊤

i . Then,
n
X
2 2
E= (d⊤
i a − vi ) = kD a − vk . (6)
i=1
So, linear least squares problem can be described very compactly as
min kD a − vk2 . (7)

a
To show that the solution in Eq. 5 minimizes error E, need to

differentiate E with respect to a and set it to zero:
dE
= 0. (8)
da
How to do this differentiation?

The obvious (but hard) way:

 2
Xn X m
E=  aj xij + am+1 − vi  . (9)
i=1 j=1
Expand equation explicitly giving

  
 X n Xm




 2  aj xij + am+1 − vi  xik , for k 6= m + 1
 i=1 j=1
∂E 
=
∂ak
 


 X n Xm
2  aj xij + am+1 − vi  , for k = m + 1





i=1 j=1
Then, set ∂E/∂ak = 0 and solve for ak .

This is slow, tedious and error prone!

Which one do you like to be?

At least like these?

Matrix Derivatives
Matrix Derivatives
There are 6 common types of matrix derivatives:
Type Scalar Vector Matrix
∂y ∂y ∂Y
Scalar
∂x ∂x ∂x
∂y ∂y
Vector
∂x ∂x
∂y
Matrix
∂X

Matrix Derivatives
Derivatives by Scalar
Numerator Layout Notation Denominator Layout Notation

∂y ∂y
∂x ∂x
 
∂y1

∂y  ∂x 
∂y

∂y1 ∂ym

∂y⊤
= .. 
= ··· ≡
.

∂x   ∂x ∂x ∂x ∂x
 ∂ym 
∂x
 
∂y11 ∂y1n
···
∂Y 
 ∂x ∂x 
= .. .. .. 
. . .

∂x  
 ∂ym1 ∂ymn 
···
∂x ∂x

Matrix Derivatives
Derivatives by Vector

∂y
 
 ∂x1 
∂y ∂y ∂y ∂y  . 
= ··· = .. 
∂x ∂x1 ∂xn ∂x   ∂y


∂xn
∂y1 ∂y1 ∂y1 ∂ym
   
··· ···
 ∂x1 ∂xn   ∂x1 ∂x1 
∂y  .. .. ∂y 
.. =  ... .. ..
 
= . . .
 . .

∂x   ∂x  
 ∂ym ∂ym   ∂y1 ∂ym 
··· ···
∂x1 ∂xn ∂xn ∂xn
∂y ∂y⊤
≡ ≡
∂x⊤ ∂x

Matrix Derivatives
Derivative by Matrix

∂y ∂y ∂y ∂y
   
··· ···
 ∂x11 ∂xm1   ∂x11 ∂x1n 
∂y  . . ..  ∂y  . . .. 
= .. .. .
 = .. .. .

∂X    ∂X   
∂y ∂y  ∂y ∂y 
··· ···
∂x1n ∂xmn ∂xm1 ∂xmn
∂y ∂y
≡ ≡
∂X⊤ ∂X

Matrix Derivatives
Pictorial Representation
. . .
. . .
numerator .
layout
denominator .
layout

Matrix Derivatives
Caution
◮ Most books and papers don’t state which convention they use.
◮ Reference [2] uses both conventions but clearly differentiate them.
∂y
 
 ∂x1 
∂y ∂y ∂y ∂y  . 
= ··· = .. 
∂x⊤ ∂x1 ∂xn ∂x  
 ∂y 
∂xn
∂y1 ∂y1 ∂y1 ∂ym
   
··· ···
 ∂x1 ∂xn  ⊤
 ∂x1 ∂x1 
∂y  . . .  ∂y  . . .. 
⊤
=  ..
 .. ..  =  ..
 .. . 

∂x  ∂ym ∂x
∂ym   ∂y1 ∂ym 
··· ···
∂x1 ∂xn ∂xn ∂xn
◮ It is best not to mix the two conventions in your equations.

◮ We adopt numerator layout notation.
Matrix Derivatives Commonly Used Derivatives
Commonly Used Derivatives

Here, scalar a, vector a and matrix A are not functions of x and x.
da
(C1) = 0 (column matrix)
dx
da
(C2) = 0⊤ (row matrix)
dx
da
(C3) = 0⊤ (matrix)
dX
da
(C4) = 0 (matrix)
dx
dx
(C5) =I
dx
Matrix Derivatives Commonly Used Derivatives
d a⊤ x d x⊤ a
(C6) = = a⊤
dx dx
dx⊤ x
(C7) = 2 x⊤
dx
d(x⊤ a)2
(C8) = 2 x ⊤ a a⊤
dx
dAx
(C9) =A
dx
d x⊤A
(C10) = A⊤
dx
d x⊤Ax
(C11) = x⊤ (A + A⊤ )
dx
Matrix Derivatives Math Notation
Math Notation
We represent a vector x as a column matrix
 
x1
 x2 
x =  . .
 
.
 . 
xm
Its transpose x⊤ is a row matrix
x ⊤ = x1 x2 · · ·

xm .

Matrix Derivatives Math Notation
Consider two vectors x and y with the same number of components.

Their inner product x⊤ y is actually a 1×1 matrix:
x⊤ y = [ s ]
where
m
X
s= x i yi .
i=1
For notational inconvenience, we usually drop the matrix and

regard the inner product as a scalar, i.e.,
m
X
x⊤ y = x i yi .
i=1

Matrix Derivatives Derivatives of Scalar by Scalar
Derivatives of Scalar by Scalar
∂(u + v) ∂u ∂v
(SS1) = +
∂x ∂x ∂x
∂uv ∂v ∂u
(SS2) =u +v (product rule)
∂x ∂x ∂x
∂g(u) ∂g(u) ∂u
(SS3) = (chain rule)
∂x ∂u ∂x
∂f (g(u)) ∂f (g) ∂g(u) ∂u

(SS4) = (chain rule)
∂x ∂g ∂u ∂x

Matrix Derivatives Derivatives of Vector by Scalar
Derivatives of Vector by Scalar

∂au ∂u
(VS1) =a
∂x ∂x
where a is not a function of x.
∂Au ∂u
(VS2) =A
∂x ∂x
where A is not a function of x.
⊤
∂u⊤

∂u
(VS3) =
∂x ∂x
∂(u + v) ∂u ∂v
(VS4) = +
∂x ∂x ∂x
Matrix Derivatives Derivatives of Vector by Scalar
∂g(u) ∂g(u) ∂u
(VS5) = (chain rule)
∂x ∂u ∂x
with consistent matrix layout.
∂f (g(u)) ∂f (g) ∂g(u) ∂u

(VS6) = (chain rule)
∂x ∂g ∂u ∂x
with consistent matrix layout.

Matrix Derivatives Derivatives of Matrix by Scalar
Derivatives of Matrix by Scalar

∂aU ∂U
(MS1) =a
∂x ∂x
∂AUB ∂U
(MS2) =A B
∂x ∂x
where A and B are not functions of x.
∂(U + V) ∂U ∂V
(MS3) = +
∂x ∂x ∂x
∂UV ∂V ∂U
(MS4) =U + V (product rule)
∂x ∂x ∂x
Matrix Derivatives Derivatives of Scalar by Vector
Derivatives of Scalar by Vector

∂au ∂u
(SV1) =a
∂x ∂x
∂(u + v) ∂u ∂v
(SV2) = +
∂x ∂x ∂x
∂uv ∂v ∂u
(SV3) =u +v (product rule)
∂x ∂x ∂x
∂g(u) ∂g(u) ∂u
(SV4) = (chain rule)
∂x ∂u ∂x
∂f (g(u)) ∂f (g) ∂g(u) ∂u

(SV5) = (chain rule)
∂x ∂g ∂u ∂x
Matrix Derivatives Derivatives of Scalar by Vector
∂u⊤ v ∂v ∂u
(SV6) = u⊤ + v⊤ (product rule)
∂x ∂x ∂x
∂u ∂v
where and are in numerator layout.
∂x ∂x
∂u⊤Av ∂v ∂u
(SV7) = u⊤A + v⊤A⊤ (product rule)
∂x ∂x ∂x
∂u ∂v
where and are in numerator layout,
∂x ∂x
and A is not a function of x.

Matrix Derivatives Derivatives of Scalar by Matrix
Derivatives of Scalar by Matrix

∂au ∂u
(SM1) =a
∂X ∂X
where a is not a function of X.
∂(u + v) ∂u ∂v
(SM2) = +
∂X ∂X ∂X
∂uv ∂v ∂u
(SM3) =u +v (product rule)
∂X ∂X ∂X
∂g(u) ∂g(u) ∂u
(SM4) = (chain rule)
∂X ∂u ∂X
∂f (g(u)) ∂f (g) ∂g(u) ∂u

(SM5) = (chain rule)
∂X ∂g ∂u ∂X
Matrix Derivatives Derivatives of Vector by Vector
Derivatives of Vector by Vector

∂au ∂u ∂a
(VV1) =a +u (product rule)
∂x ∂x ∂x
∂Au ∂u
(VV2) =A
∂x ∂x
where A is not a function of x.
∂(u + v) ∂u ∂v
(VV3) = +
∂x ∂x ∂x
∂g(u) ∂g(u) ∂u
(VV4) = (chain rule)
∂x ∂u ∂x
∂f (g(u)) ∂f (g) ∂g(u) ∂u

(VV5) = (chain rule)
∂x ∂g ∂u ∂x
Matrix Derivatives Notes on Denominator Layout
Notes on Denominator Layout

In some cases, the results of denominator layout are the transpose of
those of numerator layout. Moreover, the chain rule for denominator
layout goes from right to left instead of left to right.
da⊤ x da⊤ x
(C7) = a⊤ =a
dx dx
dx⊤Ax dx⊤Ax
(C11) = x⊤ (A + A⊤ ) = (A + A⊤ )x
dx dx
∂f (g(u)) ∂f (g) ∂g(u) ∂u ∂f (g(u)) ∂u ∂g(u) ∂f (g)
(VV5) = =
∂x ∂g ∂u ∂x ∂x ∂x ∂u ∂g

Matrix Derivatives Derivations of Derivatives
Derivations of Derivatives
d a⊤ x d x⊤ a
(C6) = = a⊤
dx dx
(The not-so-hard way)
∂s ds
Let s = a⊤ x = a1 x1 + · · · + an xn . Then, = ai . So, = a⊤ .
∂xi dx
(The easier way)
X ∂s ds
Let s = a⊤ x = ai xi . Then, = ai . So, = a⊤ .
∂xi dx
i
dx⊤ x
(C7) = 2 x⊤
dx
X ∂s ds
Let s = x⊤ x = x2i . Then, = 2xi . So, = 2 x⊤ .
∂xi dx
i
d(x⊤ a)2
(C8) = 2 x ⊤ a a⊤
dx
∂s2 ∂s ds2
Let s = x⊤ a. Then, = 2s = 2 s ai . So, = 2 x ⊤ a a⊤ .
∂xi ∂xi dx
dAx
(C9) =A
dx
(The hard
 way)    
a11 · · · a1n x1 a11 x1 + · · · + a1n xn
 .. .. ..   ..  =  ..
Ax =  . .

. .  .   .
an1 · · · ann xn an1 x1 + · · · + ann xn
(The easy way)
X ∂si ds
Let s = Ax. Then, si = aij xj , and = aij . So, = A.
∂xj dx
j
d x⊤A
(C10) = A⊤
dx
Let y⊤ = x⊤ A, and aj denote the j-th column of A. Then, yi = x⊤ aj .
dyi dy⊤
Applying (C6) yields = a⊤
j . So, = A⊤ .
dx dx
d x⊤Ax
(C11) = x⊤ (A + A⊤ )
dx
d x⊤Ax dAx dx
Apply (SV6) to and obtain x⊤ + (Ax)⊤ ,
dx dx dx
Next, apply (C9) to the first part of the sum, and obtain
x⊤A + (Ax)⊤ , which is x⊤ (A + A⊤ ).
(Need to prove SV6—Homework.)


Now, let us show that the solution
a = (D⊤ D)−1 D⊤ v.
minimizes error E
n
X
2 2
E= (d⊤
i a − vi ) = kDa − vk .
i=1
Proof:
E = kDa − vk2 = (Da − v)⊤ (Da − v)

= (a⊤ D⊤ − v⊤ )(Da − v)
= a⊤ D⊤ Da − a⊤ D⊤ v − v⊤ Da + v⊤ v.

E = a⊤ D⊤ Da − a⊤ D⊤ v − v⊤ Da + v⊤ v.
Apply (C11), (C6), (C9) and (C2) to the four terms.

dE
= a⊤ (D⊤ D + D⊤ D) − (D⊤ v)⊤ − v⊤ D + 0
da
= 2 a⊤ D⊤ D − 2 v⊤ D.
Set dE/da = 0 and obtain

2 a⊤ D ⊤ D − 2 v ⊤ D = 0
a⊤ D ⊤ D = v ⊤ D
Transpose both sides of the equation and get

D⊤ D a = D⊤ v
a = (D⊤ D)−1 D⊤ v.
Summary
Summary
◮ Matrix calculus studies calculus of matrices.
◮ There are 6 common derivatives of matrices.
◮ There are 2 competing notational convention:
numerator layout notation vs. denominator layout convention.
◮ We adopt numerator layout notation.
◮ Do not mix the two conventions in your equations.
◮ Use matrix differentiation to prove that pseudo-inverse minimizes
sum square error.

Summary
Use the right tool

become lightning fast!
Probing Questions
Probing Questions
◮ Is there a simple way to double check that the derivative result
makes sense?
◮ Why do we use sum square error for linear fitting? Can we use
other forms of errors?
◮ Six common types of matrix derivatives are discussed. Three other
types are left out. Can we work out the other derivatives, e.g.,
derivatives of vector by matrix or matrix by matrix?

Homework
Homework
1. What are the key concepts that you have learned?
2. Prove the product rule SV3 using scalar product rule SS2.
∂uv ∂v ∂u
(SV3) =u +v
∂x ∂x ∂x
3. Prove the product rule SV6 using SV3.
∂u⊤ v ∂v ∂u
(SV6) = u⊤ + v⊤
∂x ∂x ∂x
∂u ∂v
where and are in numerator layout.
∂x ∂x

References
References
1. J. E. Gentle, Matrix Algebra: Theory, Computations, and Applications in
Statistics, Springer, 2007.
2. H. Lütkepohl, Handbook of Matrices, John Wiley & Sons, 1996.
3. K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, 2012.
www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
4. Wikipedia, Matrix Calculus.
en.wikipedia.org/wiki/Matrix_calculus

Matrix Differentiation Rules and Application

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Differentiation Rules and Application

Uploaded by

Copyright:

Available Formats

Matrix Differentiation

CS5240 Theoretical Foundations in Multimedia

Leow Wee Kheng

Department of Computer Science

Leow Wee Kheng (NUS) Matrix Differentiation 1 / 36

Linear Fitting Revisited

The linear function f (pi ) has the form

f (p) = f (x1 , . . . , xm ) = a1 x1 + · · · + am xm + am+1 . (2)

Leow Wee Kheng (NUS) Matrix Differentiation 2 / 36

The data points are organized into a matrix equation

The solution of Eq. 3 is

a = (D⊤ D)−1 D⊤ v. (5)

Leow Wee Kheng (NUS) Matrix Differentiation 3 / 36

Denote each row of D as d⊤

So, linear least squares problem can be described very compactly as

min kD a − vk2 . (7)

To show that the solution in Eq. 5 minimizes error E, need to

Leow Wee Kheng (NUS) Matrix Differentiation 4 / 36

The obvious (but hard) way:

Expand equation explicitly giving

Then, set ∂E/∂ak = 0 and solve for ak .

Leow Wee Kheng (NUS) Matrix Differentiation 5 / 36

Which one do you like to be?

Leow Wee Kheng (NUS) Matrix Differentiation 6 / 36

At least like these?

Leow Wee Kheng (NUS) Matrix Differentiation 7 / 36

Type Scalar Vector Matrix

Leow Wee Kheng (NUS) Matrix Differentiation 8 / 36

Numerator Layout Notation Denominator Layout Notation

Leow Wee Kheng (NUS) Matrix Differentiation 9 / 36

Numerator Layout Notation Denominator Layout Notation

Leow Wee Kheng (NUS) Matrix Differentiation 10 / 36

Numerator Layout Notation Denominator Layout Notation

Leow Wee Kheng (NUS) Matrix Differentiation 11 / 36

Leow Wee Kheng (NUS) Matrix Differentiation 12 / 36

◮ It is best not to mix the two conventions in your equations.

Commonly Used Derivatives

Its transpose x⊤ is a row matrix

Leow Wee Kheng (NUS) Matrix Differentiation 16 / 36

Consider two vectors x and y with the same number of components.

For notational inconvenience, we usually drop the matrix and

Leow Wee Kheng (NUS) Matrix Differentiation 17 / 36

Derivatives of Scalar by Scalar

∂f (g(u)) ∂f (g) ∂g(u) ∂u

Leow Wee Kheng (NUS) Matrix Differentiation 18 / 36

Derivatives of Vector by Scalar

∂f (g(u)) ∂f (g) ∂g(u) ∂u

Leow Wee Kheng (NUS) Matrix Differentiation 20 / 36

Derivatives of Matrix by Scalar

Derivatives of Scalar by Vector

∂f (g(u)) ∂f (g) ∂g(u) ∂u

Leow Wee Kheng (NUS) Matrix Differentiation 23 / 36

Derivatives of Scalar by Matrix

∂f (g(u)) ∂f (g) ∂g(u) ∂u

Derivatives of Vector by Vector

∂f (g(u)) ∂f (g) ∂g(u) ∂u

Notes on Denominator Layout

Numerator Layout Notation Denominator Layout Notation

Leow Wee Kheng (NUS) Matrix Differentiation 26 / 36

Leow Wee Kheng (NUS) Matrix Differentiation 29 / 36

Linear Fitting Revisited

E = kDa − vk2 = (Da − v)⊤ (Da − v)

Leow Wee Kheng (NUS) Matrix Differentiation 30 / 36