Matrix Calc

Appendix D
Matrix Calculus
From too much study, and from extreme passion, cometh madnesse.
− Isaac Newton [205, §5]
D.1 Gradient, Directional derivative, Taylor series

D.1.1 Gradients
Gradient of a differentiable real function f (x) : RK → R with respect to its vector
argument is defined uniquely in terms of partial derivatives
∂f (x)
 
∂x1
 
 ∂f (x) 
∇f (x) ,  ∂x2  ∈ RK (2053)
 
..
.
 
 
∂f (x)
∂xK
while the second-order gradient of the twice differentiable real function with respect to its
vector argument is traditionally called the Hessian ;
∂ 2f (x) ∂ 2f (x) ∂ 2f (x)

 
∂x21 ∂x1 ∂x2 ··· ∂x1 ∂xK
 
∂ 2f (x) ∂ 2f (x) ∂ 2f (x)
 
2
 ···  ∈ SK

∇ f (x) ,  ∂x2 ∂x1
..
∂x22
..
∂x2 ∂xK
.. (2054)
 .. 

 . . . . 

∂ 2f (x) ∂ 2f (x) ∂ 2f (x)
∂xK ∂x1 ∂xK ∂x2 ··· ∂xK 2
interpreted
³ ´ ³ ´
(x) (x)
∂ 2f (x) ∂ ∂f
∂x1 ∂ ∂f
∂x2 ∂ 2f (x)
= = = (2055)
∂x1 ∂x2 ∂x2 ∂x1 ∂x2 ∂x1
Dattorro, Convex Optimization Euclidean Distance Geometry, Mεβoo, 2005, v2020.02.29. 599
600 APPENDIX D. MATRIX CALCULUS
The gradient of vector-valued function v(x) : R → RN on real domain is a row vector

h i
∇v(x) , ∂v1 (x)
∂x
∂v2 (x)
∂x ··· ∂vN (x)
∂x
∈ RN (2056)
while the second-order gradient is

h 2 i
∂ 2 v2 (x) ∂ 2 vN (x)
∇ 2 v(x) , ∂ ∂x v1 (x)
2 ∂x2 ··· ∂x2
∈ RN (2057)
Gradient of vector-valued function h(x) : RK → RN on vector domain is

 ∂h1 (x) ∂h2 (x) ∂hN (x) 
∂x1 ∂x1 ··· ∂x1
 
∂h1 (x) ∂h2 (x) ∂hN (x)
 ··· 
∇h(x) ,  ∂x2 ∂x2 ∂x2
 
.. .. .. 
. . . (2058)
 
 
∂h1 (x) ∂h2 (x) ∂hN (x)
∂xK ∂xK ··· ∂xK
= [ ∇h1 (x) ∇h2 (x) · · · ∇hN (x) ] ∈ RK×N
while the second-order gradient has a three-dimensional written representation dubbed

cubix ;D.1
1 (x) 2 (x) ∂hN (x)

∇ ∂h∂x ∇ ∂h∂x
 
1 1
··· ∇ ∂x1
 
 ∇ ∂h1 (x) 2 (x) ∂hN (x)
∇ ∂h∂x ··· ∇ 
∇ 2 h(x) ,  ∂x2 ∂x2
 2

.. .. .. 
. . . (2059)
 
 
(x) (x) ∂hN (x)
∇ ∂h∂x1K ∇ ∂h∂x2K ··· ∇ ∂xK
∇ 2 h1 (x) ∇ 2 h2 (x) · · · ∇ 2 hN (x) ∈ RK×N ×K

£ ¤
=
where the gradient of each real entry is with respect to vector x as in (2053).
The gradient of real function g(X) : RK×L→ R on matrix domain is
 ∂g(X) ∂g(X) ∂g(X) 
∂X11 ∂X12 ··· ∂X1L
 
∂g(X) ∂g(X) ∂g(X)
···
 
∇g(X) ,  ∂X21 ∂X22 ∂X2L  ∈ RK×L
 
.. .. ..
. . .
 
 
∂g(X) ∂g(X) ∂g(X)
∂XK1 ∂XK2 ··· ∂XKL
(2060)
£
∇X(:,1) g(X)
∇X(:,2) g(X)
= ∈ RK×1×L
..
.
¤
∇X(:,L) g(X)
where gradient ∇X(:, i) is with respect to the i th column of X . The strange appearance of
(2060) in RK×1×L is meant to suggest a third dimension perpendicular to the page (not
D.1 The word matrix comes from the Latin for womb ; related to the prefix matri- derived from mater
meaning mother.
D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 601
a diagonal matrix). The second-order gradient has representation

 ∂g(X)
∇ ∂X11 ∇ ∂g(X) · · · ∇ ∂g(X)

∂X12 ∂X1L
 
 ∂g(X)
 ∇ ∂X21 ∇ ∂g(X) · · · ∇ ∂g(X) 
2
∇ g(X) ,  ∂X ∂X  ∈ RK×L×K×L
22 2L

.. .. ..
. . .
 
 
∂g(X) ∂g(X) ∂g(X)
∇ ∂XK1 ∇ ∂XK2 · · · ∇ ∂XKL
(2061)
£
∇∇X(:,1) g(X)
∇∇X(:,2) g(X)
= ∈ RK×1×L×K×L
..
.
¤
∇∇X(:,L) g(X)
where the gradient ∇ is with respect to matrix X .
Gradient of vector-valued function g(X) : RK×L→ RN on matrix domain is a cubix
£
∇X(:,1) g1 (X) ∇X(:,1) g2 (X) · · · ∇X(:,1) gN (X)
∇X(:,2) g1 (X) ∇X(:,2) g2 (X) · · · ∇X(:,2) gN (X)
∇g(X) ,
.. .. ..
. . .
¤ (2062)
∇X(:,L) g1 (X) ∇X(:,L) g2 (X) · · · ∇X(:,L) gN (X)
= [ ∇g1 (X) ∇g2 (X) · · · ∇gN (X) ] ∈ RK×N ×L

while the second-order gradient has a five-dimensional representation;
£
∇∇X(:,1) g1 (X) ∇∇X(:,1) g2 (X) · · · ∇∇X(:,1) gN (X)
∇∇X(:,2) g1 (X) ∇∇X(:,2) g2 (X) · · · ∇∇X(:,2) gN (X)
∇ 2 g(X) ,
.. .. ..
. . .
¤ (2063)
∇∇X(:,L) g1 (X) ∇∇X(:,L) g2 (X) · · · ∇∇X(:,L) gN (X)
= ∇ 2 g1 (X) ∇ 2 g2 (X) · · · ∇ 2 gN (X) ∈ RK×N ×L×K×L

£ ¤
The gradient of matrix-valued function g(X) : RK×L→ RM ×N on matrix domain has

a four-dimensional representation called quartix (fourth-order tensor )
∇g11 (X) ∇g12 (X) · · · ∇g1N (X)
 
 ∇g21 (X) ∇g22 (X) · · · ∇g2N (X) 

 
∇g(X) ,  .. .. ..  ∈ RM ×N ×K×L (2064)
 . . . 
∇gM 1 (X) ∇gM 2 (X) · · · ∇gMN (X)
while the second-order gradient has a six-dimensional representation
∇ 2 g11 (X) ∇ 2 g12 (X) · · · ∇ 2 g1N (X)
 
 ∇ 2 g21 (X) ∇ 2 g22 (X) · · · ∇ 2 g2N (X) 

 
∇ 2 g(X) ,  .. .. ..  ∈ RM ×N ×K×L×K×L (2065)
 . . . 
2 2 2
∇ gM 1 (X) ∇ gM 2 (X) · · · ∇ gMN (X)
and so on.
D.1.2 Product rules for matrix-functions

Given dimensionally compatible matrix-valued functions of matrix variable f (X) and g(X)
∇X f (X)T g(X) = ∇X (f ) g + ∇X (g) f

¡ ¢
(2066)
while [65, §8.3] [420]
³ ¡ ¢´¯¯
∇X tr f (X)T g(X) = ∇X tr f (X)T g(Z ) + tr g(X) f (Z )T ¯
¡ ¢ ¢ ¡
(2067)
Z←X
These expressions implicitly apply as well to scalar-, vector-, or matrix-valued functions

of scalar, vector, or matrix arguments.
D.1.2.0.1 Example. Cubix.

Suppose f (X) : R2×2→ R2 = X Ta and g(X) : R2×2→ R2 = Xb . We wish to find
∇X f (X)T g(X) = ∇X aTX 2 b

¡ ¢
(2068)
using the product rule. Formula (2066) calls for
∇X aTX 2 b = ∇X (X Ta) Xb + ∇X (Xb) X Ta (2069)
Consider the first of the two terms:

∇X (f ) g = £∇X (X Ta) Xb
(2070)
= ∇(X Ta)1 ∇(X Ta)2 Xb
¤
The gradient of X Ta forms a cubix in R2×2×2 ; a.k.a, third-order tensor.

 
∂(X Ta)1 ∂(X Ta)2
(2071)
∂X11
JJ ∂X11
JJ
JJ JJ
 
JJ JJ
 
 
 
∂(X Ta)1 ∂(X Ta)2 
 (Xb)1
 

 ∂X12 ∂X12 
 ∈ R2×1×2
T

∇X (X a) Xb = 
 
 
∂(X Ta)1 ∂(X Ta)2
 
  (Xb)
 ∂X21
JJ ∂X21
JJ  2
JJ JJ
JJ JJ
 
 
 
 ∂(X Ta)1 ∂(X Ta)2 
∂X22 ∂X22
Because gradient of the product (2068) requires total change with respect to change in
each entry of matrix X , the Xb vector must make an inner product with each vector in
that second dimension of the cubix indicated by dotted line segments;
a1 0
 
0 a1  b1 X11 + b2 X12
· ¸
2×1×2
T

∇X (X a) Xb =  
 b1 X21 + b2 X22 ∈ R

a2 0
0 a2 (2072)
· ¸
a1 (b1 X11 + b2 X12 ) a1 (b1 X21 + b2 X22 )
= ∈ R2×2
a2 (b1 X11 + b2 X12 ) a2 (b1 X21 + b2 X22 )
= abTX T
where the cubix appears as a complete 2 × 2 × 2 matrix. In like manner for the second
term ∇X (g) f
b1 0
 
b2 0 
· ¸
X11 a1 + X21 a2
∈ R2×1×2
T

∇X (Xb) X a =  
 0 b1  X12 a1 + X22 a2 (2073)
0 b2
= X TabT ∈ R2×2
The solution
∇X aTX 2 b = abTX T + X TabT (2074)
can be found from Table D.2.1 or verified using (2067). 2
D.1.2.1 Kronecker product

A partial remedy for venturing into hyperdimensional matrix representations, such as
the cubix or quartix, is to first vectorize matrices as in (39). This device gives rise
to the Kronecker product of matrices ⊗ ; a.k.a, tensor product (kron() in Matlab).
Although its definition sees reversal in the literature, [434, §2.1] Kronecker product is not
commutative (B ⊗ A 6= A ⊗ B). We adopt the definition: for A ∈ Rm×n and B ∈ Rp×q
 
B11 A B12 A · · · B1q A
 B21 A B22 A · · · B2q A 
B⊗A , .. .. ..  ∈ Rpm×q n (2075)
 
 . . . 
Bp1 A Bp2 A · · · Bpq A
for which A ⊗ 1 = 1 ⊗ A = A (real unity acts like Identity).
One advantage to vectorization is existence of the traditional two-dimensional matrix
representation (second-order tensor ) for the second-order gradient of a real function with
respect to a vectorized matrix. From §A.1.1 no.36 (§D.2.1) for square A , B ∈ Rn×n , for
example [220, §5.2] [15, §3]
2
2 T 2 T T T T n × n2
∇vec X tr(AXBX ) = ∇vec X vec(X) (B ⊗A) vec X = B ⊗A + B ⊗A ∈ R (2076)
To disadvantage is a large new but known set of algebraic rules (§A.1.1) and the fact
that its mere use does not generally guarantee two-dimensional matrix representation of
gradients.
Another application of the Kronecker product is to reverse order of appearance in
a matrix product: Suppose we wish to weight the columns of a matrix S ∈ RM ×N , for
example, by respective entries wi from the main diagonal in
w1 0
 
. ..  ∈ SN
W , (2077)
0 wN
A conventional means for accomplishing column weighting is to multiply S by diagonal
matrix W on the right side:
w1 0
 
..  = S(: , 1)w1 · · · S(: , N )wN ∈ RM ×N
£ ¤
S W = S . (2078)
0 wN
To reverse product order such that diagonal matrix W instead appears to the left of S :
for I ∈ SM (Law)
 
S(: , 1) 0 0
.
0 S(: , 2) . .
 
S W = (δ(W )T ⊗ I )  ∈ RM ×N (2079)
 
. .. . ..
 0 
0 0 S(: , N )
To instead weight the rows of S via diagonal matrix W ∈ SM , for I ∈ SN

 
S(1 , :) 0 0
..

0 S(2 , :) . 
WS =  (δ(W ) ⊗ I ) ∈ RM ×N (2080)
 
.. ..
 . . 0 
0 0 S(M , :)
D.1.2.2 Hadamard product

For any matrices of like size, S , Y ∈ RM ×N , Hadamard’s product ◦ denotes simple
multiplication of corresponding entries (.* in Matlab). It is possible to convert Hadamard
product into a standard product of matrices:
 
S(: , 1) 0 0
.
¤ 0 S(: , 2) . .

 ∈ RM ×N (2081)
£
S ◦ Y = δ(Y (: , 1)) · · · δ(Y (: , N )) 
 
.. ..
 . . 0 
0 0 S(: , N )
In the special case that S = s and Y = y are vectors in RM

s ◦ y = δ(s)y (2082)
sT ⊗ y = ysT
(2083)
s ⊗ y T = sy T
D.1.3 Chain rules for composite matrix-functions

Given dimensionally compatible matrix-valued functions of matrix variable f (X) and
g(X) [462, §15.7]
∇X g f (X)T = ∇X f T ∇f g
¡ ¢
(2084)
2 T T 2 T 2
¡ ¢ ¡ ¢
∇X g f (X) = ∇X ∇X f ∇f g = ∇X f ∇f g + ∇X f ∇f g ∇X f (2085)
D.1.3.1 Two arguments
∇X g f (X)T , h(X)T = ∇X f T ∇f g + ∇X hT ∇h g
¡ ¢
(2086)
D.1.3.1.1 Example. Chain rule for two arguments. [51, §1.1]

T
g f (x)T , h(x)T = (f (x) + h(x)) A(f (x) + h(x))
¡ ¢
(2087)
· ¸ · ¸
x1 εx1
f (x) = , h(x) = (2088)
εx2 x2
· ¸ · ¸
1 0 ε 0
∇x g f (x)T , h(x)T = (A + AT )(f + h) + (A + AT )(f + h)
¡ ¢
(2089)
0 ε 0 1
· ¸ µ· ¸ · ¸¶
1+ε 0 x1 εx1
∇x g f (x)T , h(x)T = (A + AT )
¡ ¢
+ (2090)
0 1+ε εx2 x2
T T T
¡ ¢
lim ∇x g f (x) , h(x) = (A + A )x (2091)
ε→0
from Table D.2.1. 2
These foregoing formulae remain correct when gradient produces hyperdimensional

representation:
D.1.4 First directional derivative

Assume that a differentiable function g(X) : RK×L→ RM ×N has continuous first- and
second-order gradients ∇g and ∇ 2 g over dom g which is an open set. We seek
simple expressions for the first and second directional derivatives in direction Y ∈ RK×L :
→Y →Y
respectively, dg ∈ RM ×N and dg 2 ∈ RM ×N .
Assuming that the limit exists, we may state the partial derivative of the mn th entry
of g with respect to kl th entry of X ;
∂gmn (X) gmn (X + ∆t ek eT

l ) − gmn (X)
= lim ∈R (2092)
∂Xkl ∆t→0 ∆t
where ek is the k th standard basis vector in RK while el is the l th standard basis vector in
RL . Total number of partial derivatives equals KLM N while the gradient is defined in
their terms; mn th entry of the gradient is
 
∂gmn (X) ∂gmn (X) ∂gmn (X)
∂X11 ∂X12 ··· ∂X1L
 
 ∂gmn (X) ∂gmn (X) ∂gmn (X) 

∂X21 ∂X22 ··· ∂X2L  ∈ RK×L

∇gmn (X) =  (2093)
 .. .. .. 

 . . . 

∂gmn (X) ∂gmn (X) ∂gmn (X)
∂XK1 ∂XK2 ··· ∂XKL
while the gradient is a quartix
∇g11 (X) ∇g12 (X) ··· ∇g1N (X)

 
 ∇g (X) ∇g (X) ··· ∇g2N (X) 
21 22
∇g(X) =   ∈ RM ×N ×K×L (2094)
 
.. .. ..
 . . . 
∇gM 1 (X) ∇gM 2 (X) ··· ∇gMN (X)
By simply rotating our perspective of a four-dimensional representation of gradient matrix,

we find one of three useful transpositions of this quartix (connoted T1 ):
 ∂g(X) ∂g(X) ∂g(X)

∂X11 ∂X12 ··· ∂X1L
 
∂g(X) ∂g(X) ∂g(X)
···
 
∇g(X)T1 =  ∂X21 ∂X22 ∂X2L  ∈ RK×L×M ×N (2095)
 
.. .. ..

 . . . 

∂g(X) ∂g(X) ∂g(X)
∂XK1 ∂XK2 ··· ∂XKL
When a limit for ∆t ∈ R exists, it is easy to show by substitution of variables in (2092)
∂gmn (X) gmn (X + ∆t Ykl ek eT

l ) − gmn (X)
Ykl = lim ∈R (2096)
∂Xkl ∆t→0 ∆t
which may be interpreted as the change in gmn at X when the change in Xkl is equal
to Ykl the kl th entry of any Y ∈ RK×L . Because the total change in gmn (X) due to Y is
the sum of change with respect to each and every Xkl , the mn th entry of the directional
derivative is the corresponding total differential [462, §15.8]
X ∂gmn (X)
Ykl = tr ∇gmn (X)T Y
¡ ¢
dgmn (X)|dX→Y = (2097)
∂Xkl
k, l
X gmn (X + ∆t Ykl ek eT
l ) − gmn (X)
= lim (2098)
∆t→0 ∆t
k, l
gmn (X + ∆t Y ) − gmn (X)
= lim (2099)
¯
∆t→0 ∆t
d¯ ¯
= gmn (X + t Y ) (2100)
dt ¯t=0
where t ∈ R . Assuming finite Y , equation (2099) is called the Gâteaux differential
[50, App.A.5] [265, §D.2.1] [474, §5.28] whose existence is implied by existence of the
Fréchet differential (the sum in (2097)). [337, §7.2] Each may be understood as the change
in gmn at X when the change in X is equal in magnitude and direction to Y .D.2 Hence
the directional derivative,
dg11 (X) dg12 (X) · · · dg1N (X)
 ¯
¯
¯
→Y  dg21 (X) dg22 (X) · · · dg2N (X)  ¯
dg (X) ,   ¯ ∈ RM ×N
 ¯
.. .. ..
 . . . ¯
¯
dg (X) dg (X) · · · dg (X) ¯
M1 M2 MN dX→Y
tr ∇g11 (X)T Y tr ∇g12 (X)T Y tr ∇g1N (X)T Y

¡ ¢ ¡ ¢ ¡ ¢
···
 
tr ∇g21 (X)T Y tr ∇g22 (X)T Y tr ∇g2N (X)T Y

¡ ¢ ¡ ¢ ¡ ¢
 ··· 
= 
 
.. .. .. 
 . . .  (2101)
tr ∇gM 1 (X)T Y tr ∇gM 2 (X)T Y tr ∇gMN (X)T Y
¡ ¢ ¡ ¢ ¡ ¢
···
 P ∂g11 (X) P ∂g12 (X) P ∂g1N (X) 
∂Xkl Ykl ∂Xkl Ykl ··· ∂Xkl Ykl
 k, l k, l k, l 
 P ∂g (X) P ∂g22 (X) P ∂g2N (X) 
 21
Ykl Ykl ··· Y 
kl 
=  k, l ∂X. kl ∂Xkl ∂Xkl

k, l k, l
.. .. .. 
. .
 
 
 P ∂gM 1 (X) P ∂gM 2 (X) P ∂gMN (X) 
∂Xkl Ykl ∂Xkl Ykl ··· ∂Xkl Ykl
k, l k, l k, l
from which it follows

→Y X ∂g(X)
dg (X) = Ykl (2102)
∂Xkl
k, l
Yet for all X ∈ dom g , any Y ∈ RK×L , and some open interval of t ∈ R
→Y
g(X + t Y ) = g(X) + t dg (X) + O(t2 ) (2103)
which is the first-order multidimensional Taylor series expansion about X . [462, §18.4]
[203, §2.3.4] Differentiation with respect to t and subsequent t-zeroing isolates the second
term of expansion. Thus differentiating and zeroing g(X + t Y ) in t is an operation
equivalent to individually differentiating and zeroing every entry gmn (X + t Y ) as in
(2100). So the directional derivative of g(X) : RK×L→ RM ×N in any direction Y ∈ RK×L
evaluated at X ∈ dom g becomes
¯
→Y d ¯¯
dg (X) = g(X + t Y ) ∈ RM ×N (2104)
dt ¯t=0
D.2 Although Y is a matrix, we may regard it as a vector in RKL .
υ ✡T
✡
f (α + t y) ✡
✡
✡
(α , f (α))✡
 
∇x f (α)
✡ f (x)
υ , ✡
 
→∇x f (α)

1 ✡
2 df(α) ✡
∂H
Figure 216: Strictly convex quadratic bowl in R2 × R ; f (x) = xTx : R2 → R versus x

on some open disc in R2 . Plane slice ∂H is perpendicular to function domain. Slice
intersection with domain connotes bidirectional vector y . Slope of tangent line T at
point (α , f (α)) is value of directional derivative ∇x f (α)Ty (2129) at α in slice direction y .
Negative gradient −∇x f (x) ∈ R2 is direction of steepest descent. [74, §9.4.1] [462, §15.6]
3
[203] [519] When · vector
¸ υ ∈ R entry υ3 is half directional derivative in gradient direction
υ1
at α and when = ∇x f (α) , then −υ points directly toward bowl bottom.
υ2
[371, §2.1, §5.4.5] [43, §6.3.1] which is simplest. In case of a real function g(X) : RK×L→ R
→Y
dg (X) = tr ∇g(X)T Y
¡ ¢
(2126)
In case g(X) : RK → R
→Y
dg (X) = ∇g(X)T Y (2129)
Unlike gradient, directional derivative does not expand dimension; directional
derivative (2104) retains the dimensions of g . The derivative with respect to t makes
the directional derivative resemble ordinary calculus (§D.2); e.g, when g(X) is linear,
→Y
dg (X) = g(Y ). [337, §7.2]
D.1.4.1 Interpretation of directional derivative

In the case of any differentiable real function g(X) : RK×L→ R , the directional derivative
of g(X) at X in any direction Y yields the slope of g along the line {X + t Y | t ∈ R}
through its domain evaluated at t = 0. For higher-dimensional functions, by (2101), this
slope interpretation can be applied to each entry of the directional derivative.
Figure 216, for example, shows a plane slice of a real convex bowl-shaped function
f (x) along a line {α + t y | t ∈ R} through its domain. The slice reveals a one-dimensional
real function of t ; f (α + t y). The directional derivative at x = α in direction y is the
slope of f (α + t y) with respect to t at t = 0. In the case of a real function having
vector argument h(X) : RK → R , its directional derivative in the normalized direction of
its gradient is the gradient magnitude. (2129) For a real function of real variable, the
directional derivative evaluated at any point in the function domain is just the slope of
that function there scaled by the real direction. (confer §3.6)
Directional derivative generalizes our one-dimensional notion of derivative to a

multidimensional domain. When direction Y coincides with a member of the standard
Cartesian basis ek eTl (63), then a single partial derivative ∂g(X)/∂Xkl is obtained from
directional derivative (2102); such is each entry of gradient ∇g(X) in equalities (2126)
and (2129), for example.
D.1.4.1.1 Theorem. Directional derivative optimality condition. [337, §7.4]

Suppose f (X) : RK×L→ R is minimized on convex set C ⊆ RK×L by X ⋆ , and the
directional derivative of f exists there. Then for all X ∈ C
→X−X ⋆
df (X) ≥ 0 (2105)
⋄
D.1.4.1.2 Example. Simple bowl.

Bowl function (Figure 216)
f (x) : RK → R , (x − a)T (x − a) − b (2106)
has function offset −b ∈ R , axis of revolution at x = a , and positive definite Hessian

(2054) everywhere in its domain (an open hyperdisc in RK ); id est, strictly convex
quadratic f (x) has unique global minimum equal to −b at x = a . A vector −υ based
anywhere in dom f × R pointing toward the unique bowl-bottom is specified:
· ¸
x−a
υ ∝ ∈ RK × R (2107)
f (x) + b
Such a vector is  
∇x f (x)
υ= (2108)
 
→∇x f (x)

1
2 df(x)
since the gradient is
∇x f (x) = 2(x − a) (2109)
and the directional derivative in direction of the gradient is (2129)

→∇x f (x)
df(x) = ∇x f (x)T ∇x f (x) = 4(x − a)T (x − a) = 4(f (x) + b) (2110)
2
D.1.5 Second directional derivative

By similar argument, it so happens: the second directional derivative is equally simple.
Given g(X) : RK×L→ RM ×N on open domain,
∂ 2gmn (X) ∂ 2gmn (X) ∂ 2gmn (X)

 
∂Xkl ∂X11 ∂Xkl ∂X12 ··· ∂Xkl ∂X1L
 
 ∂ 2gmn (X) ∂ 2gmn (X) ∂ 2gmn (X) 
∂gmn (X) ∂∇gmn (X) 
∂Xkl ∂X21 ∂Xkl ∂X22 ··· ∂Xkl ∂X2L  ∈ RK×L

∇ = = (2111)
∂Xkl ∂Xkl  .. .. .. 

 . . . 

∂ 2gmn (X) ∂ 2gmn (X) ∂ 2gmn (X)
∂Xkl ∂XK1 ∂Xkl ∂XK2 ··· ∂Xkl ∂XKL
 
mn (X) mn (X) mn (X)
∇ ∂g∂X11
∇ ∂g∂X 12
··· ∇ ∂g∂X 1L
 
 ∂gmn (X) mn (X) mn (X)
∇ ∂g∂X ∇ ∂g∂X

 ∇ ···  ∈ RK×L×K×L
2

∇ gmn (X) =  ∂X21 22 2L
 .. .. .. 

 . . . 

mn (X) mn (X) mn (X)
∇ ∂g∂X K1
∇ ∂g∂X K2
··· ∇ ∂g∂X KL
(2112)
∂∇gmn (X) ∂∇gmn (X) ∂∇gmn (X)
 
∂X11 ∂X12 ··· ∂X1L
 
···
 
=
 ∂X21 ∂X22 ∂X2L 
.. .. .. 

 . . . 

∂XK1 ∂XK2 ··· ∂XKL
Rotating our perspective, we get several views of the second-order gradient:
∇ 2 g11 (X) ∇ 2 g12 (X) ··· ∇ 2 g1N (X)

 
 ∇ 2 g (X) ∇ 2 g22 (X) ··· ∇ 2 g2N (X) 
21
∇ 2 g(X) =   ∈ RM ×N ×K×L×K×L (2113)
 
.. .. ..
 . . . 
∇ 2 gM 1 (X) ∇ 2 gM 2 (X) ··· ∇ 2 gMN (X)
 
∇ ∂g(X)
∂X11 ∇ ∂g(X)
∂X12 ··· ∇ ∂g(X)
∂X1L
 
 ∂g(X)
 ∇ ∂X21 ∇ ∂g(X) ··· ∇ ∂g(X)

2
∇ g(X) T1
= ..
∂X22
..
∂X2L 
..  ∈ RK×L×M ×N ×K×L (2114)

 . . . 

∇ ∂g(X)
∂XK1 ∇ ∂g(X)
∂XK2 ··· ∂g(X)
∇ ∂XKL
 ∂∇g(X) ∂∇g(X) ∂∇g(X)

∂X11 ∂X12 ··· ∂X1L
 
∂∇g(X) ∂∇g(X) ∂∇g(X)
···
 
2
∇ g(X) T2
= ∂X21 ∂X22 ∂X2L  ∈ RK×L×K×L×M ×N (2115)
 
.. .. ..

 . . . 

∂∇g(X) ∂∇g(X) ∂∇g(X)
∂XK1 ∂XK2 ··· ∂XKL
Assuming the limits to exist, we may state the partial derivative of the mn th entry of g
with respect to kl th and ij th entries of X ;
∂gmn (X+∆t ek eT
³ ´
∂ 2gmn (X) ∂ ∂gmn (X) l )−∂gmn (X)
∂Xkl ∂Xij = ∂Xij ∂Xkl = lim ∂X ij ∆t
∆t→0
(2116)
(gmn (X+∆t ek eTl +∆τ ei eTj )−gmn (X+∆t ek eTl ))− (gmn (X+∆τ ei eTj )−gmn (X))
= lim ∆τ ∆t
∆τ,∆t→0
Differentiating (2096) and then scaling by Yij
∂ 2gmn (X) ∂gmn (X+∆t Ykl ek eT

l )−∂gmn (X)
∂Xkl ∂Xij Ykl Yij = lim
∆t→0 ∂X ij ∆t Yij (2117)
(gmn (X+∆t Ykl ek eTl +∆τ Yij ei eTj )−gmn (X+∆t Ykl ek eTl ))− (gmn (X+∆τ Yij ei eTj )−gmn (X))
= lim ∆τ ∆t
∆τ,∆t→0
which can be proved by substitution of variables in (2116). The mn th second-order total

differential due to any Y ∈ RK×L is
X X ∂ 2gmn (X) ³ ¢T ´
d 2gmn (X)|dX→Y = Ykl Yij = tr ∇X tr ∇gmn (X)T Y Y
¡
(2118)
i,j
∂Xkl ∂Xij
k, l
X ∂gmn (X + ∆t Y ) − ∂gmn (X)
= lim Yij (2119)
i,j
∆t→0 ∂Xij ∆t
gmn (X + 2∆t Y ) − 2gmn (X + ∆t Y ) + gmn (X)
= lim (2120)
∆t→0 ∆t2
2 ¯
¯
d ¯
= gmn (X + t Y ) (2121)
dt2 ¯t=0
Hence the second directional derivative,
 2
d g11 (X) d 2g12 (X) ··· d 2g1N (X)
¯
¯
¯
→Y  d 2g21 (X) d 2g22 (X) ··· d 2g2N (X)
¯
dg 2(X) ,   ¯ ∈ RM ×N
 ¯
.. .. ..
 . . . ¯
¯
d 2gM 1 (X) d 2gM 2 (X) ··· 2
d gMN (X) ¯dX→Y
 ³ ¢T ´ ³ ¢T ´ ³ ¢T ´ 
tr ∇tr ∇g11 (X)T Y Y tr ∇tr ∇g12 (X)T Y Y · · · tr ∇tr ∇g1N (X)T Y Y
¡ ¡ ¡
 ³ ¢T ´ ³ ¢T ´ ³ ¢T ´ 
 tr ∇tr ∇g21 (X)T Y Y tr ∇tr ∇g22 (X)T Y Y · · · tr ∇tr ∇g2N (X)T Y Y 
 ¡ ¡ ¡ 
= 
 .. .. .. 

 ³ . . . 
¡ T
¢T ´ ³ ¡ T
¢T ´ ³ ¡ T
¢T ´
tr ∇tr ∇gM 1 (X) Y Y tr ∇tr ∇gM 2 (X) Y Y · · · tr ∇tr ∇gMN (X) Y Y
∂ 2g1N (X)
 PP 2 
∂ g11 (X) PP ∂ 2g12 (X) PP
Ykl Yij ∂Xkl ∂Xij Ykl Yij ··· ∂Xkl ∂Xij Ykl Yij
 i,j k, l ∂Xkl ∂Xij i,j k, l i,j k, l 
 2

 P P ∂ 2g21 (X) PP 2
∂ g22 (X) PP ∂ g2N (X)
∂Xkl ∂Xij Ykl Yij ∂Xkl ∂Xij Ykl Yij ··· ∂Xkl ∂Xij Ykl Yij

 
=
 i,j k, l i,j k, l i,j k, l
 (2122)
.. .. .. 

 . . . 

 P P ∂ 2gM 1 (X) PP ∂ 2gM 2 (X) P P ∂ 2gMN (X)
∂Xkl ∂Xij Ykl Yij ∂Xkl ∂Xij Ykl Yij ··· Ykl Yij

∂Xkl ∂Xij
i,j k, l i,j k, l i,j k, l
from which it follows

→Y X X ∂ 2g(X) X ∂ →Y
2
dg (X) = Ykl Yij = dg (X) Yij (2123)
i,j
∂Xkl ∂Xij i,j
∂Xij
k, l
Yet for all X ∈ dom g , any Y ∈ RK×L , and some open interval of t ∈ R
→Y 1 2 →Y2
g(X + t Y ) = g(X) + t dg (X) + t dg (X) + O(t3 ) (2124)
2!
which is the second-order multidimensional Taylor series expansion about X . [462, §18.4]
[203, §2.3.4] Differentiating twice with respect to t and subsequent t-zeroing isolates the
third term of the expansion. Thus differentiating and zeroing g(X + t Y ) in t is an
operation equivalent to individually differentiating and zeroing every entry gmn (X + t Y )
as in (2121). So the second directional derivative of g(X) : RK×L→ RM ×N becomes
[371, §2.1, §5.4.5] [43, §6.3.1]
→Y
d 2 ¯¯
¯
dg (X) = 2 ¯ g(X + t Y ) ∈ RM ×N
2
(2125)
dt t=0
which is again simplest. (confer (2104)) Directional derivative retains the dimensions of g .
D.1.6 directional derivative expressions

In the case of a real function g(X) : RK×L→ R , all its directional derivatives are in R :
→Y
dg (X) = tr ∇g(X)T Y
¡ ¢
(2126)
→Y ³ ¢T ´
µ
→Y
¶
2 T T
¡
dg (X) = tr ∇X tr ∇g(X) Y Y = tr ∇X dg (X) Y (2127)
Ã !
→Y µ ³ ¢T ´T
¶ →Y
3 T 2 T
¡
dg (X) = tr ∇X tr ∇X tr ∇g(X) Y Y Y = tr ∇X dg (X) Y (2128)
In the case g(X) : RK → R has vector argument, they further simplify:

→Y
dg (X) = ∇g(X)T Y (2129)
→Y
dg (X) = Y T ∇ 2 g(X)Y
2
(2130)
→Y ¢T
dg (X) = ∇X Y T ∇ 2 g(X)Y Y
3
¡
(2131)
and so on.
D.1.7 higher-order multidimensional Taylor series

Series expansions of the differentiable matrix-valued function g(X) , of matrix argument,
were given earlier in (2103) and (2124). Assume that g(X) has continuous first-, second-,
and third-order gradients over open set dom g . Then, for X ∈ dom g and any Y ∈ RK×L ,
the Taylor series is expressed on some open interval of µ ∈ R
→Y 1 2 →Y2 1 3 →Y3
g(X + µY ) = g(X) + µ dg (X) + µ dg (X) + µ dg (X) + O(µ4 ) (2132)
2! 3!
or on some open interval of kY k2
→Y −X 1 →Y2 −X 1 →Y3 −X
g(Y ) = g(X) + dg(X) + dg (X) + dg (X) + O(kY k4 ) (2133)
2! 3!
which are third-order expansions about X . The mean value theorem from calculus is what
insures finite order of the series. [462] [51, §1.1] [50, App.A.5] [265, §0.4] These somewhat
unbelievable formulaeD.3 imply that a function can be determined over the whole of its
domain by knowing its value and all its directional derivatives at a single point X .
D.1.7.0.1 Example. Inverse-matrix function.

Say g(Y ) = Y −1 . From the table on page 616,
¯
→Y d ¯¯
dg (X) = g(X + t Y ) = −X −1 Y X −1 (2134)
dt ¯t=0
→Y
d 2 ¯¯
¯
dg (X) = 2 ¯ g(X + t Y ) = 2X −1 Y X −1 Y X −1
2
(2135)
dt t=0
2
D.3 e.g, real continuous and differentiable function of real variable f (x) = e−1/x has no Taylor series
expansion about x = 0 , of any practical use, because each derivative equals 0 there.
→Y
d 3 ¯¯
¯
3
dg (X) = 3 ¯ g(X + t Y ) = −6X −1 Y X −1 Y X −1 Y X −1 (2136)
dt t=0
Let’s find the Taylor series expansion of g about X = I : Since g(I ) = I , for kY k2 < 1
(µ = 1 in (2132))
g(I + Y ) = (I + Y )−1 = I − Y + Y 2 − Y 3 + . . . (2137)
If Y is small, (I + Y )−1 ≈ I − Y .D.4 Now we find Taylor series expansion about X :
g(X + Y ) = (X + Y )−1 = X −1 − X −1 Y X −1 + 2X −1 Y X −1 Y X −1 − . . . (2138)
If Y is small, (X + Y )−1 ≈ X −1 − X −1 Y X −1 . 2
D.1.7.0.2 Exercise. log det . (confer [74, p.644])

Find the first three terms of a Taylor series expansion for log det Y . Specify an open
interval over which the expansion holds in vicinity of X . H
D.1.8 Correspondence of gradient to derivative

From the foregoing expressions for directional derivative, we derive a relationship between
gradient with respect to matrix X and derivative with respect to real variable t :
D.1.8.1 first-order
Removing evaluation at t = 0 from (2104),D.5 we find an expression for the directional

derivative of g(X) in direction Y evaluated anywhere along a line {X + t Y | t ∈ R}
intersecting dom g
→Y d
dg (X + t Y ) = g(X + t Y ) (2139)
dt
In the general case g(X) : RK×L→ RM ×N , from (2097) and (2100) we find
d
tr ∇X gmn (X + t Y )T Y = gmn (X + t Y )
¡ ¢
(2140)
dt
which is valid at t = 0 , of course, when X ∈ dom g . In the important case of a real
function g(X) : RK×L→ R , from (2126) we have simply
d
tr ∇X g(X + t Y )T Y = g(X + t Y )
¡ ¢
(2141)
dt
When g(X) : RK → R has vector argument,
d
∇X g(X + t Y )T Y = g(X + t Y ) (2142)
dt
D.4 Had we instead set g(Y ) = (I + Y )−1 , then the equivalent expansion would have been about X = 0.
D.5 Justified by replacing X with X + t Y in (2097)-(2099); beginning,
X ∂gmn (X + t Y )
dgmn (X + t Y )|dX→Y = Ykl
k, l
∂Xkl
D.1.8.1.1 Example. Gradient.

g(X) = wTX TXw , X ∈ RK×L , w ∈ RL . Using the tables in §D.2,
tr ∇X g(X + t Y )T Y = tr 2wwT(X T + t Y T )Y
¡ ¢ ¡ ¢
(2143)
T T T
= 2w (X Y + t Y Y )w (2144)
Applying equivalence (2141),
d d T
g(X + t Y ) = w (X + t Y )T (X + t Y )w (2145)
dt dt ¡
= wT X T Y + Y TX + 2t Y T Y w
¢
(2146)
T T T
= 2w (X Y + t Y Y )w (2147)
which is the same as (2144). Hence, the equivalence is demonstrated.

It is easy to extract ∇g(X) from (2147) knowing only (2141):
tr ∇X g(X + t Y )T Y 2wT¡(X T Y + t Y T Y )w ¢
¡ ¢
=
= 2 tr wwT(X T + t Y T )Y
tr ∇X g(X)T Y 2 tr wwTX T Y (2148)
¡ ¢ ¡ ¢
=
⇔
∇X g(X) = 2XwwT
2
D.1.8.2 second-order
Likewise removing the evaluation at t = 0 from (2125),
→Y
2 d2
dg (X + t Y ) = g(X + t Y ) (2149)
dt2
we can find a similar relationship between second-order gradient and second derivative: In
the general case g(X) : RK×L→ RM ×N from (2118) and (2121),
³ ¢T ´ d2
tr ∇X tr ∇X gmn (X + t Y )T Y Y = 2 gmn (X + t Y )
¡
(2150)
dt
In the case of a real function g(X) : RK×L→ R we have, of course,

³ ¢T ´ d2
tr ∇X tr ∇X g(X + t Y )T Y Y = 2 g(X + t Y )
¡
(2151)
dt
From (2130), the simpler case, where real function g(X) : RK → R has vector argument,
d2
Y T ∇X2 g(X + t Y )Y = g(X + t Y ) (2152)
dt2
D.1.8.2.1 Example. Second-order gradient.

We want to find ∇ 2 g(X) ∈ RK×K×K×K given real function g(X) = log det X having
domain intr SK
+ . From the tables in §D.2,
h(X) , ∇g(X) = X −1 ∈ intr SK

+ (2153)
so ∇ 2 g(X) = ∇h(X). By (2140) and (2103), for Y ∈ SK

¯
d ¯¯
tr ∇hmn (X)T Y
¡ ¢
= hmn (X + t Y ) (2154)
dt ¯t=0
µ ¯ ¶
d ¯¯
= h(X + t Y ) (2155)
dt ¯
µ ¯t=0 mn
¶
d ¯¯ −1
= (X + t Y ) (2156)
dt ¯t=0 mn
= − X −1 Y X −1 mn
¡ ¢
(2157)
K×K
Setting Y to a member of {ek eTl ∈R | k , l = 1 . . . K } , and employing a property (41)
of the trace function we find
∇ 2 g(X)mnkl = tr ∇hmn (X)T ek eT = ∇hmn (X)kl = − X −1 ek eT −1
¡ ¢ ¡ ¢
l l X mn
(2158)
∇ 2 g(X)kl = ∇h(X)kl = − X −1 ek eT −1
∈ RK×K
¡ ¢
l X (2159)
2
From all these first- and second-order expressions, we may generate new ones by evaluating
both sides at arbitrary t (in some open interval) but only after differentiation.
D.2 Tables of gradients and derivatives

Results may be validated numerically via Richardson extrapolation. [332, §5.4] [146]
When algebraically proving results for symmetric matrices, it is critical to take
gradients ignoring symmetry and to then substitute symmetric entries afterward.
[220] [78]
i, j , k , ℓ , K , L , m , n , M , N are integers, unless otherwise noted, a , b ∈ Rn , x , y ∈ Rk ,
A , B ∈ Rm×n , X , Y ∈ RK×L , t , µ ∈ R .
x µ means δ(δ(x)µ ) for µ ∈ R ; id est, entrywise vector exponentiation. δ is the
main-diagonal linear operator (1681). x0 , 1 , X 0 , I if square.
 d 
dx1 →y
..  →y
d
, , dg(x) , dg 2(x) (directional derivatives §D.1), log x , e x , |x| , x/y

dx .
d
dxk √
(Hadamard quotient), sgn x , ◦ x (entrywise square root), etcetera, are maps
k k
f : R → R that maintain dimension; e.g, (§A.1.1)
d −1
x , ∇x 1T δ(x)−1 1 (2160)
dx
For A a scalar or square matrix, we have the Taylor series [98, §3.6]
∞
X 1 k
eA , A (2161)
k!
k=0
Further, [440, §5.4]
eA ≻ 0 ∀ A ∈ Sm (2162)
For all square A and integer k
detk A = det Ak (2163)
D.2. TABLES OF GRADIENTS AND DERIVATIVES 615
D.2.1 algebraic
∇x x = ∇x xT = I ∈ Rk×k ∇X X = ∇X X T , I ∈ RK×L×K×L (Identity)

∇x 1T x = ∇x xT 1 = 1 ∈ Rk ∇X 1TX 1 = ∇X 1TX T 1 = 11T∈ RK×L
T
∇x (Ax
¡ T − b) T=¢ A
∇x x A − b = A
∇x (Ax − b)T(Ax − b) = 2AT(Ax − b)

∇x2p
(Ax − b)T(Ax − b) = 2ATA
∇x (Ax − b)T(Ax − b) = AT(Ax − b)/kAx − bk2 = ∇x kAx − bk2
∇x z T |Ax − b| = AT δ(z) sgn(Ax − b) , zi 6= 0 ⇒ (Ax − b)i 6= 0

∇x 1T |Ax − b| = AT sgn(Axµ − ¯b) = ∇x kAx
¶ − bk1
(y)
∇x 1T f (|Ax − b|) = AT δ dfdy sgn(Ax − b)
¯
¯
y=|Ax−b|
∇x xTAx + 2xTB y + y TC y = ¢A + AT x + 2B y
¡ ¡¢ ¢
+ y)TA(x + y) = A +¢AT (x + y)
¡
∇x (x
∇x x Ax + 2xTB y + y TC y = A + AT
2
¡ T
∇X aTXb = ∇X bTX Ta = abT
∇X aTX 2 b = X T abT + abT X T
∇X aTX −1 b = −X −T abT X −T
confer
∂X −1
∇X (X −1 )kl = = −X −1 ek eT
l X −1
, (2095)
∂Xkl
(2159)
∇x aTxTxb = 2xaT b ∇X aTX TXb = X(abT + baT )
∇x aTxxT b = (abT + baT )x ∇X aTXX T b = (abT + baT )X
∇x aTxTxa = 2xaTa ∇X aTX TXa = 2XaaT
∇x aTxxTa = 2aaTx ∇X aTXX T a = 2aaT X
∇x aTyxT b = b aTy ∇X aT Y X T b = baT Y
∇x aTy Tx b = y bTa ∇X aT Y TXb = Y abT
∇x aTxy T b = a bTy ∇X aTXY T b = abT Y
∇x aTxTy b = y aT b ∇X aTX T Y b = Y baT

algebraic continued
d
dt (X + tY ) = Y
d T
dt B (X + t Y )−1 A = −B T (X + t Y )−1 Y (X + t Y )−1 A
d T
dt B (X + t Y )−TA = −B T (X + t Y )−T Y T (X + t Y )−TA
d T
dt B (X + t Y )µ A = . . . , −1 ≤ µ ≤ 1 , X , Y ∈ SM+
d2
dt2
B T (X + t Y )−1 A = 2B T (X + t Y )−1 Y (X + t Y )−1 Y (X + t Y )−1 A
d3
dt3
B T (X + tY )−1
A = −6B T (X + t Y )−1 Y (X + t Y )−1 Y (X + t Y )−1 Y (X + t Y )−1 A
d
(X + t Y )TA(X + t Y ) = Y TAX + X TAY + 2 t Y TAY
¡ ¢
dt ¡
d2
(X + t Y )TA(X + t Y ) = 2 Y TAY
¢
dt2¡ ¢−1
d T
dt (X¡+ t Y ) A(X + t Y ) ¢−1 T ¢−1
= − (X + t Y ) A(X + t Y ) (Y AX + X TAY + 2 t Y TAY ) (X + t Y )TA(X + t Y )
T
¡
d
dt ((X + t Y )A(X + t Y )) = YAX + XAY + 2 t YAY
d2
dt2
((X + t Y )A(X + t Y )) = 2 YAY
D.2.2 trace Kronecker
∇vec X tr(A XBX T ) = ∇vec X vec(X)T (B T ⊗ A) vec X = (B ⊗ AT + B T ⊗ A) vec X
2 T 2 T T T T
∇vec X tr(A XBX ) = ∇vec X vec(X) (B ⊗ A) vec X = B ⊗ A + B ⊗ A (2076)
D.2.3 trace
∇x µ x = µI ∇X tr µX = ∇X µ tr X = µI
d −1
∇x 1T δ(x)−1 1 = dx x = −x−2 ∇X tr X −1 = −X −2T
∇x 1 δ(x) y = −δ(x)−2 y
T −1
∇X tr(X −1 Y ) = ∇X tr(Y X −1 ) = −X −T Y TX −T
d µ
dx x = µx µ−1 ∇X tr X µ = µX µ−1 , X ∈ SM
∇X tr X j = jX (j−1)T
¢T
∇x (b − aTx)−1 = (b − aTx)−2 a ∇X tr (B − AX)−1 = (B − AX)−2 A
¡ ¢ ¡
∇x (b − aTx)µ = −µ(b − aTx)µ−1 a
∇x xTy = ∇x y Tx = y ∇X tr(X T Y ) = ∇X tr(Y X T ) = ∇X tr(Y TX) = ∇X tr(XY T ) = Y

∇x xTx = 2x ∇X tr(X TX ) = ∇X tr(XX T ) = 2X
∇X tr(AXBX T ) = ∇X tr(XBX TA) = ATXB T + AXB

∇X tr(AXBX) = ∇X tr(XBXA) = ATX TB T + B TX TAT
∇X tr(AXAXAXAX) = ∇X tr(XAXAXAXA) = 4(AXAXAXA )T

∇X tr(AXAXAX) = ∇X tr(XAXAXA) = 3(AXAXA )T
∇X tr(AXAX) = ∇X tr(XAXA) = 2(AXA )T
∇X tr(AX) = ∇X tr(XA) = AT
k−1 ¢T
∇X tr(Y X k ) = ∇X tr(X k Y ) = X i Y X k−1−i
P¡
i=0
∇X tr(X T Y Y TXX T Y Y TX) = 4Y Y TXX T Y Y TX

∇X tr(XY Y TX TXY Y TX T ) = 4XY Y TX TXY Y T
∇X tr(Y TXX T Y ) = ∇X tr(X T Y Y TX) = 2Y Y TX
∇X tr(Y TX TXY ) = ∇X tr(XY Y TX T ) = 2XY Y T
∇X tr (X + Y )T (X + Y ) = 2(X + Y ) = ∇X kX + Y k2F
¡ ¢
∇X tr((X + Y )(X + Y )) = 2(X + Y )T
∇X tr(ATXB) = ∇X tr(X TAB T ) = AB T

∇X tr(A X B) = ∇X tr(X AB ) = −X AB T X −T
T −1 −T T −T
∇X aTXb = ∇X tr(baTX) = ∇X tr(XbaT ) = abT

∇X bTX Ta = ∇X tr(X TabT ) = ∇X tr(abTX T ) = abT
∇X aTX −1 b = ∇X tr(X −T abT ) = −X −T abT X −T
µ
∇X aTX b = . . .
trace continued
d d
dt tr g(X + t Y ) = tr dt g(X + t Y ) [273, p.491]
d
dt tr(X + t Y ) = tr Y
d
dt tr j(X + t Y ) = j tr j−1(X + t Y ) tr Y
d
tr(X + t Y )j = j tr (X + t Y )j−1 Y
¡ ¢
dt (∀ j)
d
dt tr((X + t Y )Y ) = tr Y 2
d d
tr (X + t Y )k Y = tr(Y (X + t Y )k ) = k tr (X + t Y )k−1 Y 2 ,
¡ ¢ ¡ ¢
dt dt k ∈ {0, 1, 2}
k−1
d d
tr (X + t Y )k Y = tr(Y (X + t Y )k ) = tr (X + t Y )i Y (X + t Y )k−1−i Y
¡ ¢ P
dt dt
i=0
d
tr¡(X + t Y )−1 Y ¢ = − tr¡(X + t Y )−1 Y (X + t Y )−1 Y ¢
¡ ¢ ¡ ¢
dt
d
dt tr¡B T (X + t Y )−1 A¢ = − tr¡B T (X + t Y )−1 Y (X + t Y )−1 A ¢
d
dt tr¡B T (X + t Y )−TA ¢ = − tr B T (X + t Y )−T Y T (X + t Y )−TA
d
dt tr B T (X + t Y )−k A = . . . , k>0
d
tr B T (X + t Y )µ A = . . . , −1 ≤ µ ≤ 1 , X , Y ∈ SM
¡ ¢
dt +
d2
tr B T (X + t Y )−1 A = 2 tr B T (X + t Y )−1 Y (X + t Y )−1 Y (X + t Y )−1 A
¡ ¢ ¡ ¢
dt2
d
(X + t Y )TA(X + t Y ) = tr Y TAX + X TAY + 2 t Y TAY
¡ ¢ ¡ ¢
dt tr ¡
d2 T
¢ ¡ T ¢
dt2
tr (X + t Y ) A(X + t Y ) = 2 tr Y AY
³¡ ´
d −1
+ t Y )TA(X + t Y )
¢
dt tr (X ³¡
T
¢−1 T ¢−1 ´
(Y AX + X AY + 2 t Y TAY ) (X + t Y )TA(X + t Y )
T
¡
= − tr (X + t Y ) A(X + t Y )
d
dt tr((X + t Y )A(X + t Y )) = tr(YAX + XAY + 2 t YAY )
d2
dt2
tr((X + t Y )A(X + t Y )) = 2 tr(YAY )
D.2.4 logarithmic determinant

x ≻ 0 , det X > 0 on some neighborhood of X , and det(X + t Y ) > 0 on some open
interval of t ; otherwise, log( ) would be discontinuous. [107, p.75]
d
dx log x = x−1 ∇X log det X = X −T
∂X −T −1 T
∇X2 log det(X)kl = = − X −1 ek eT
¡ ¢
l X , confer (2112)(2159)
∂Xkl
d
dx log x−1 = −x−1 ∇X log det X −1 = −X −T
d
dx log x µ = µx−1 ∇X log detµ X = µX −T
µ
∇X log det X = µX −T
∇X log det X k = ∇X log detk X = kX −T
∇X log detµ (X + t Y ) = µ(X + t Y )−T
1
∇x log(aTx + b) = a aTx+b ∇X log det(AX + B) = AT(AX + B)−T
∇X log det(I ± ATXA) = ±A(I ± ATXA)−TAT
∇X log det(X + t Y )k = ∇X log detk (X + t Y ) = k(X + t Y )−T
d
dt log det(X + t Y ) = tr ((X + t Y )−1 Y )
d2
dt2
log det(X + t Y ) = − tr ((X + t Y )−1 Y (X + t Y )−1 Y )
d
dt log det(X + t Y )−1 = − tr ((X + t Y )−1 Y )
d2
dt2
log det(X + t Y )−1 = tr ((X + t Y )−1 Y (X + t Y )−1 Y )
d
dt log det(δ(A(x
³ + t y) + a)2 + µI) ´
−1
= tr (δ(A(x + t y) + a)2 + µI) 2δ(A(x + t y) + a)δ(Ay)
D.2.5 determinant
∇X det X = ∇X det X T = det(X)X −T
∇X det X −1 = − det(X −1 )X −T = − det(X)−1 X −T
∇X detµ X = µ detµ (X)X −T

µ µ
∇X det X = µ det(X )X −T
∇X det X k = k detk−1(X) tr(X)I − X T , X ∈ R2×2

¡ ¢
∇X det X k = ∇X detk X = k det(X k )X −T = k detk (X)X −T
∇X detµ (X + t Y ) = µ detµ (X + t Y )(X + t Y )−T
∇X det(X + t Y )k = ∇X detk (X + t Y ) = k detk (X + t Y )(X + t Y )−T
d
dt det(X + t Y ) = det(X + t Y ) tr((X + t Y )−1 Y )
d2
det(X + t Y ) = det(X + t Y )(tr 2 (X + t Y )−1 Y − tr((X + t Y )−1 Y (X + t Y )−1 Y ))
¡ ¢
dt2
d
dt det(X + t Y )−1 = − det(X + t Y )−1 tr((X + t Y )−1 Y )
d2
dt2
det(X + t Y )−1 = det(X + t Y )−1 (tr 2 ((X + t Y )−1 Y ) + tr((X + t Y )−1 Y (X + t Y )−1 Y ))
d
dt detµ (X + t Y ) = µ detµ (X + t Y ) tr((X + t Y )−1 Y )
D.2.6 logarithmic
Matrix logarithm.
d
dt log(X + t Y )µ = µY (X + t Y )−1 = µ(X + t Y )−1 Y , XY = YX
d
dt log(I − t Y )µ = −µY (I − t Y )−1 = −µ(I − t Y )−1 Y [273, p.493]
D.2.7 exponential
Matrix exponential. [98, §3.6, §4.5] [440, §5.4]
T T T
∇X etr(Y X)
= ∇X det eY X
= etr(Y X)
Y (∀ X , Y )
T T T
YT
∇X tr¡eY X = ¢eY X Y T = Y T eX (∀ X , Y )
∇X tr AeY X = . . .
∇x 1T eAx = ATeAx
∇x 1T e|Ax| = AT δ(sgn(Ax))e|Ax| (Ax)i 6= 0
1
∇x log(1T e x ) = ex
1T e x
µ ¶
1 x 1 x xT
∇x2 log(1T e x ) = δ(e ) − e e
1T e x 1T e x
k
µ k
¶
Q 1
1 Q 1
∇x xi =k
xi 1/x
k
i=1 k i=1
k
µ k
¶µ ¶
1
1 1
1
∇x2 −2 T
Q Q
xi = −
k
xi δ(x) − (1/x)(1/x)
k
i=1 k i=1 k
d tY
dt e = etY Y = Y etY
d X+ t Y
dt e = eX+ t Y Y = Y eX+ t Y , XY = YX
d 2 X+ t Y
dt2
e = eX+ t Y Y 2 = Y eX+ t Y Y = Y 2 eX+ t Y , XY = YX
d j tr(X+ t Y )
e = etr(X+ t Y ) tr j(Y )
dt j
D.2.7.0.1 Exercise. Expand these tables.

Provide four unfinished table entries indicated by . . . in §D.2.1 & §D.2.3. H
D.2.7.0.2 Exercise. log . (§D.1.7, §3.5.4)

Find the first four terms of the Taylor series expansion
· for log
¸ · ¸ x about x = 1. Plot the
x 1
supporting hyperplane to the hypograph of log x at = . Prove log x ≤ x − 1 .
log x 0
H

Matrix Calc

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Calc

Uploaded by

Copyright:

Available Formats

Appendix D

− Isaac Newton [205, §5]

D.1 Gradient, Directional derivative, Taylor series

∂ 2f (x) ∂ 2f (x) ∂ 2f (x)

The gradient of vector-valued function v(x) : R → RN on real domain is a row vector

while the second-order gradient is

Gradient of vector-valued function h(x) : RK → RN on vector domain is

= [ ∇h1 (x) ∇h2 (x) · · · ∇hN (x) ] ∈ RK×N

while the second-order gradient has a three-dimensional written representation dubbed

1 (x) 2 (x) ∂hN (x)

∇ 2 h1 (x) ∇ 2 h2 (x) · · · ∇ 2 hN (x) ∈ RK×N ×K

a diagonal matrix). The second-order gradient has representation

= [ ∇g1 (X) ∇g2 (X) · · · ∇gN (X) ] ∈ RK×N ×L

= ∇ 2 g1 (X) ∇ 2 g2 (X) · · · ∇ 2 gN (X) ∈ RK×N ×L×K×L

The gradient of matrix-valued function g(X) : RK×L→ RM ×N on matrix domain has

 ∇g21 (X) ∇g22 (X) · · · ∇g2N (X) 

 ∇ 2 g21 (X) ∇ 2 g22 (X) · · · ∇ 2 g2N (X) 

D.1.2 Product rules for matrix-functions

∇X f (X)T g(X) = ∇X (f ) g + ∇X (g) f

These expressions implicitly apply as well to scalar-, vector-, or matrix-valued functions

D.1.2.0.1 Example. Cubix.

∇X f (X)T g(X) = ∇X aTX 2 b

using the product rule. Formula (2066) calls for

∇X aTX 2 b = ∇X (X Ta) Xb + ∇X (Xb) X Ta (2069)

Consider the first of the two terms:

The gradient of X Ta forms a cubix in R2×2×2 ; a.k.a, third-order tensor.

D.1.2.1 Kronecker product

To instead weight the rows of S via diagonal matrix W ∈ SM , for I ∈ SN

D.1.2.2 Hadamard product

In the special case that S = s and Y = y are vectors in RM

D.1.3 Chain rules for composite matrix-functions

D.1.3.1 Two arguments

D.1.3.1.1 Example. Chain rule for two arguments. [51, §1.1]

from Table D.2.1. 2

These foregoing formulae remain correct when gradient produces hyperdimensional

D.1.4 First directional derivative

∂gmn (X) gmn (X + ∆t ek eT

while the gradient is a quartix

∇g11 (X) ∇g12 (X) ··· ∇g1N (X)

By simply rotating our perspective of a four-dimensional representation of gradient matrix,

When a limit for ∆t ∈ R exists, it is easy to show by substitution of variables in (2092)

∂gmn (X) gmn (X + ∆t Ykl ek eT

tr ∇g11 (X)T Y tr ∇g12 (X)T Y tr ∇g1N (X)T Y

tr ∇g21 (X)T Y tr ∇g22 (X)T Y tr ∇g2N (X)T Y

from which it follows

Figure 216: Strictly convex quadratic bowl in R2 × R ; f (x) = xTx : R2 → R versus x

D.1.4.1 Interpretation of directional derivative

Directional derivative generalizes our one-dimensional notion of derivative to a

D.1.4.1.1 Theorem. Directional derivative optimality condition. [337, §7.4]

D.1.4.1.2 Example. Simple bowl.

f (x) : RK → R , (x − a)T (x − a) − b (2106)

has function offset −b ∈ R , axis of revolution at x = a , and positive definite Hessian

and the directional derivative in direction of the gradient is (2129)

D.1.5 Second directional derivative

∂ 2gmn (X) ∂ 2gmn (X) ∂ 2gmn (X)

Rotating our perspective, we get several views of the second-order gradient:

∇ 2 g11 (X) ∇ 2 g12 (X) ··· ∇ 2 g1N (X)

Differentiating (2096) and then scaling by Yij

∂ 2gmn (X) ∂gmn (X+∆t Ykl ek eT

which can be proved by substitution of variables in (2116). The mn th second-order total

from which it follows

D.1.6 directional derivative expressions

In the case g(X) : RK → R has vector argument, they further simplify:

D.1.7 higher-order multidimensional Taylor series