html
Linear algebra is a branch ofm athem atics that is w idely used throughout science
and engineering.H owever,because linear algebra isa form ofcontinuousrather
than discrete m athem atics,m any com puterscientistshave little experience w ith it.
A good understanding oflinear algebra is essentialfor understanding and working
w ith m any m achine learning algorithm s,especially deep learning algorithm s. W e
therefore precede our introduction to deep learning w ith a focused presentation of
the key linearalgebra prerequisites.
Ifyou are already fam iliar w ith linear algebra,feelfree to skip this chapter.If
you have previous experience w ith these concepts but need a detailed reference
sheet to review key form ulas,w e recom m end T he M atrix , HHkbHHk (Petersen and
Pedersen, M00T). Ifyou have no exposure at allto linear algebra,this chapter
w illteach you enough to read this book,butw e highly recom m end that you also
consult another resource focused exclusively on teaching linear algebra,such as
Shilov (1b77). T his chapter w illcom pletely om it m any im portantlinear algebra
topics that are not essentialfor understanding deep learning.
P Sxalacd:A scalaris just a single num ber,in contrastto m ost ofthe other
objectsstudied in linearalgebra,w hich areusually arraysofm ultiplenum bers.
W e w rite scalars in italics. W e usually give scalars lowercase variable nam es.
W hen w e introduce them ,we specify w hat kind ofnum ber they are. For
31
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
exam ple,we m ight say Let s l , be the slope ofthe line, w hile defining a
realvalued scalar,or LetT l N be the num ber ofunits, w hile defining a
naturalnum ber scalar.
W e can think ofvectors as identifying points in space,w ith each elem ent
giving the coordinate along a dierentaxis.
Som etim esw e need to index a setofelem entsofa vector. In this case,w e
define a set containing the indices and w rite the set as a subscript. For
exam ple,to access a3,aE and aa,w e define the set e A i3.E.a} and w rite
a e .W e use the o sign to index the com plem entofa set.Forexam ple a n 3 is
the vector containing allelem ents ofa except for a3 ,and a n e is the vector
containing allofthe elem entsof a except for a3,aE and aa .
P k aecWxSd: A m atrix is a Mk array ofnum bers,so each elem entis identified
by tw o indices instead of just one. W e usually give m atrices uppercase
variable nam es w ith bold typeface,such as) . Ifa realvalued m atrix ) has
a height ofS and a w idth ofT,then w e say that ) l , S L T . W e usually
identify the elem entsofa m atrix using its nam e in italic but not bold font,
and the indices are listed w ith separating com m as.For exam ple,) 3.3 is the
upperleftentry of) and ) S .T isthe bottom rightentry. W e can identify all
ofthe num bers w ith verticalcoordinate aby w riting a e forthe horizontal
coordinate. For exam ple,) a.e denotes the horizontalcross section of) w ith
verticalcoordinate a.T his is know n as the ath coh of) .Likew ise,) e.a is
,
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
3 E
) 3.3 ) 3.2 1 M
) 3.3 ) 2.3 )E.3
) A 2 ) 2.3 ) 2.2 T d ) A
) 3.2 ) 2.2 )E.2
) E.3 ) E.2
kfdrob 0,1a T eb qoakpmlpb lcqeb j aqofu oak bb qelrdeqlcap a j foolo fj adb aoolpp qeb
j afk afadlkai,
the ath xolfm n of ) . W hen w e need to explicitly identify the elem ents of
a m atrix,we w rite them as an array enclosed in square brackets:
E T
) 3.3 ) 3.2
. (M.M)
) 2.3 ) 2.2
P r Sndocd: In som e cases we w illneed an array w ith m ore than two axes.
In the generalcase,an array ofnum bers arranged on a regular grid w ith a
variable num berofaxes is know n as a tensor. W e denote a tensor nam ed A
w ith this typeface: . . W e identify the elem entof. atcoordinates)a.P.xN
by w riting . a.P.x.
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
define a vector by w riting out its elem ents in the text inline as a row m atrix,
then using the transpose operatorto turn it into a standard colum n vector,e.g.,
a A aa3.a2.aE b .
A scalar can be thoughtofas a m atrix w ith only a single entry. From this,w e
can see thata scalar is its ow n transpose: a A a .
W e can add m atrices to each other,as long as they have the sam e shape,just
by adding theircorresponding elem ents: , A )  N w here , O.P A ) O.P  N O.P.
W e can also add a scalar to a m atrix or m ultiply a m atrix by a scalar,just
by perform ing that operation on each elem entofa m atrix: A a rN  I w here
 O.P A a rN O.P  I.
In the context ofdeep learning,w e also use som e less conventionalnotation.
W e allow the addition ofm atrix and a vector,yielding anotherm atrix: , A )  H,
w here , O.P A ) O.P  HP. In other words,the vector H is added to each row ofthe
m atrix.T his shorthand elim inates the need to define a m atrix w ith H copied into
each row before doing the addition. T his im plicit copying ofH to m any locations
is called PcoOdxOdeWng.
N ote thatthe standard product oftwo m atrices is aHt just a m atrix containing
the product ofthe individualelem ents.Such an operation exists and is called the
SlSm Sne2h WdS acodfxe or e OdOm Ocd acodfxe,and is denoted as ) N .
T he doe acodfxe betw een two vectors a and b ofthe sam e dim ensionality
is the m atrix product a b. W e can think of the m atrix product , A ) N as
com puting , O.P as the dotproduct between row O of) and colum n P of N .
.
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
)) N N A N )
. (M.b)
T his allow sus to dem onstrate equation M.a,by exploiting the fact that the value
ofsuch a product is a scalar and therefore equalto its ow n transpose:
1 M
a bA a b A b a. (M.10)
Since the focus ofthis textbook is not linear algebra,w e do not attem pt to
develop a com prehensive list ofusefulproperties ofthe m atrix product here,but
the reader should be aware thatm any m ore exist.
W e now know enough linear algebra notation to w rite dow n a system oflinear
equations:
)aA H (M.11)
w here ) l , S L T isa know n m atrix,H l , S isa know n vector,and a l , T isa
vectorofunknow n variablesw e would like to solve for.}ach elem ent aO ofa isone
ofthese unknow n variables. }ach row of) and each elem entofH provide another
constraint.W e can rew rite equation M.11 as:
) 3.ea A H3 (M.1M)
) 2.ea A H2 (M.13)
... (M.12)
) S .ea A HS (M.1E)
or,even m ore explicitly,as:
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
N i
3 M M
h M 3 M}
M M 3
ha l , T .MT a A a . (M.M0)
T he structure ofthe identity m atrix is sim ple: allofthe entriesalong the m ain
diagonalare 1,w hile allofthe otherentriesare zero. See figure M.M for an exam ple.
T he m OecWi WngScdS of) is denoted as ) n 3,and it is defined as the m atrix
such that
) n 3) A MT . (M.M1)
)aA H (M.MM)
) n 3) a A ) n3
H (M.M3)
MT a A ) n 3H (M.M2)
0
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
a A ) n 3H. (M.ME)
n3
O fcourse,this process depends on it being possible to find ) . W e discuss
the conditions forthe existence of) n 3 in the follow ing section.
W hen ) n 3 exists,severaldierentalgorithm sexistforfinding itin closed form .
In theory,the sam e inverse m atrix can then be used to solve the equation m any
tim esfor dierentvaluesofH.H owever,) n 3 isprim arily usefulasa theoretical
tool,and should not actually be used in practice for m ost software applications.
h ecause ) n 3 can be represented w ith only lim ited precision on a digitalcom puter,
algorithm s that m ake use of the value of H can usually obtain m ore accurate
estim ates ofa .
c A p a  )3 o p Nb (M.MT)
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
solution.
So farw e have discussed m atrix inverses asbeing m ultiplied on the left.It is
also possible to define an inverse that is m ultiplied on the right:
) ) n 3 A M. (M.Mb)
Som etim eswe need to m easure thesize ofa vector.In m achine learning,we usually
m easure the size ofvectorsusing a function called a nocm .Form ally,theE V norm
is given by
7 a3
b V
for V l , .V 3.
N orm s,including the E V norm ,are functions m apping vectors to nonnegative
values. O n an intuitive level,the norm ofa vector a m easures the distance from
the origin to the point a. t ore rigorously,a norm is any function L that satisfies
the follow ing properties:
P L)a N A M d a A M
P hp l , .L)p aN A kp kL)a N
T he E 2 norm ,w ith V A 2,is know n as the b fxlWdSOn nocm . It is sim ply the
}uclidean distance from the origin to the point identified by a. T heE 2 norm is
used so frequently in m achine learning thatitis often denoted sim ply as kka kk,w ith
the subscript 2 om itted.It is also com m on to m easure the size ofa vectorusing
the squared E 2 norm ,w hich can be calculated sim ply as a a.
T he squared E 2 norm is m ore convenient to w ork w ith m athem atically and
com putationally than the E 2 norm itself. For exam ple, the derivatives of the
squared E2 norm w ith respect to each elem ent ofa each depend only on the
corresponding elem ent ofa,w hile allofthe derivatives ofthe E2 norm depend
on the entire vector.In m any contexts,the squared E 2 norm m ay be undesirable
because it increases very slow ly near the origin. In several m achine learning
3
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
applications,it is im portant to discrim inate between elem ents that are exactly
zero and elem entsthatare sm allbutnonzero.In these cases,we turn to a function
that grow sat the sam e rate in alllocations,butretains m athem aticalsim plicity:
the E 3 norm .T he E 3 norm m ay be sim plified to
b
kka kk3 A kaOk. (M.31)
O
T he E 3 norm is com m only used in m achine learning w hen the dierence between
zero and nonzero elem entsis very im portant.}very tim e an elem entofa m oves
away from 0 by ,the E 3 norm increases by .
W e som etim esm easure thesize ofthe vectorby counting itsnum berofnonzero
elem ents.Som eauthorsreferto thisfunction asthe E M norm ,butthisisincorrect
term inology.T he num ber ofnonzero entriesin a vector isnot a norm ,because
scaling the vectorby p does not change the num ber ofnonzero entries. T he E 3
norm is often used as a substitute for the num ber ofnonzero entries.
O ne other norm that com m only arises in m achine learning is the E t norm ,
also know n as the m Oi nocm . T his norm sim plifies to the absolute value ofthe
elem ent w ith the largest m agnitude in the vector,
kka kkt A p dx kaOk. (M.3M)
O
Som etim esw e m ay also w ish to m easure the size ofa m atrix. In the context
ofdeep learning,the m ostcom m on w ay to do this is w ith the otherw ise obscure
:coPSnWfd nocm : eb
kk) kk1 A ) 2O.P. (M.33)
O.P
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
allOA P. W e have already seen one exam ple ofa diagonalm atrix: the identity
m atrix,w here allofthe diagonalentriesare 1.W e w rite diag)nN to denote a square
diagonalm atrix w hose diagonalentriesare given by the entries ofthe vector n.
k iagonalm atrices are ofinterestin part because m ultiplying by a diagonalm atrix
isvery com putationally e cient. To com pute diag)nNa,w e only need to scale each
elem ent aO by nO.In other words,diag)nNa A n a.Inverting a square diagonal
m atrix is also e cient. T he inverse exists only ifevery diagonalentry is nonzero,
and in that case, diag)nNn 3 A diag)a3/n3.....3/nT b N. In m any cases, we m ay
derive som e very generalm achine learning algorithm in term sofarbitrary m atrices,
but obtain a less expensive (and less descriptive) algorithm by restricting som e
m atrices to be diagonal.
N otalldiagonalm atricesneed besquare.Itispossibletoconstructa rectangular
diagonalm atrix.N onsquare diagonalm atrices do not have inverses but it is still
possible to m ultiply by them cheaply. For a nonsquare diagonalm atrix  ,the
product  a w illinvolve scaling each elem entofa,and either concatenating som e
zeros to the result if  is taller than it is w ide,or discarding som e ofthe last
elem ents ofthe vectorif  is w ider than it is tall.
A djm m SecWx m atrix is any m atrix that is equalto its ow n transpose:
) A ) . (M.3E)
Sym m etric m atrices often arise w hen the entriesare generated by som e function of
two argum ents that does not depend on the order ofthe argum ents.For exam ple,
if) isa m atrix ofdistance m easurem ents,w ith ) O.P giving the distancefrom point
Oto point P,then ) O.P A ) P.O because distance functions are sym m etric.
A fnWe gSxeoc is a vectorw ith fnWe nocm :
) ) A ) ) A M. (M.37)
.N
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
t any m athem atical objects can be understood better by breaking them into
constituentparts,orfinding som e propertiesofthem thatare universal,notcaused
by the way we choose to represent them .
For exam ple, integers can be decom posed into prim e factors. T he w ay we
represent the num ber 32 w illchange depending on w hether w e w rite it in base ten
orin binary,butitw illalw aysbe truethat32 A 2L 2L E.From thisrepresentation
we can conclude usefulproperties,such as that 32 isnot divisible by 7,or that any
integer m ultiple of32 w illbe divisible by E.
t uch as w e can discover som ething about the true nature ofan integer by
decom posing itinto prim e factors,we can also decom pose m atrices in ways that
show us inform ation about their functionalproperties that is not obvious from the
representation ofthe m atrix as an array ofelem ents.
O ne ofthe m ost w idely used kinds ofm atrix decom position is called SWgSn2
dSxom aodWeWon,in w hich w e decom pose a m atrix into a set ofeigenvectors and
eigenvalues.
A n SWgSngSxeoc ofa square m atrix ) is a nonzero vector n such that m ulti
plication by ) alters only the scale of n:
) n A zn. (M.3b)
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
,110.L.71.03M0TN0.L7b:.T1.03M0TN2e0:
/2 ..2.
2 2
/ /
..2. ..2.
.2.
. .
.2
/( ..(.
.(.
. ..(.
A/ A/
A2 A2
A( A(
A( A2 A/ . / 2 ( A( A2 A/ . / 2 (
./ ./.
kfdrob 0,1a e k bumj mib lc qeb bboq lc bfdbksboqlop mka bfdbksmirbp, F bob) t b emsb
m j mqofu ) t fqe qtl loqelkloj mibfdbksboqlop)n)3N t fqe bfdbksmirb z3 mka n )2N t fqe
bfdbksmirb z2, (Fslk/g b milq qeb pbq lcmiirkfq sboqlopm l , 2 mp m rkfq ofooib, (Lomnk/g b
milq qeb pbq lcmiimlfkqp) m, N v lnpbosfkd qeb t mv qemq ) afpqloqp qeb rkfq ofooib)t b
omk pbb qemq fq pomibp pmmob fk afoboqflk n)ON nv zO,
eigenvectors to form a m atrix h w ith one eigenvector per colum n:h A an)3N.(((.
n)T Nb.Likew ise,w e can concatenate the eigenvalues to form a vector z A az3 .(((.
zT b (T he SWgSndSxom aodWeWon of ) isthen given by
n3
) A h diag)z Nh ( (M.20)
W e have seen that :HaLtru:tiaA m atricesw ith specific eigenvalues and eigenvec
tors allow s us to stretch space in desired directions. H ow ever,w e often want to
dSxom aodS m atricesinto their eigenvalues and eigenvectors. k oing so can help
us to analyze certain properties ofthe m atrix,m uch as decom posing an integer
into its prim e factors can help us understand the behaviorofthat integer.
u otevery m atrix can bedecom posed into eigenvaluesand eigenvectors.In som e
.
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
cases,the decom position exists,butm ay involve com plex ratherthan realnum bers.
Fortunately,in this book,w e usually need to decom pose only a specific class of
m atrices that have a sim ple decom position. Specifically,every realsym m etric
m atrix can be decom posed into an expression using only realvalued eigenvectors
and eigenvalues:
) A : : . (M.21)
w here : is an orthogonal m atrix com posed of eigenvectors of ) , and is a
diagonalm atrix. T he eigenvalue O.O isassociated w ith the eigenvectorin colum n O
of: ,denoted as : e.O. h ecause : isan orthogonalm atrix,we can think of) as
scaling space by zO in direction n)ON.See figure M.3 for an exam ple.
W hile any realsym m etric m atrix ) is guaranteed to have an eigendecom posi
tion,the eigendecom position m ay not be unique.Ifany two or m ore eigenvectors
share the sam e eigenvalue,then any setoforthogonalvectorslying in their span
are also eigenvectors w ith that eigenvalue,and we could equivalently choose a :
using those eigenvectors instead. h y convention,we usually sort the entries of
in descending order. g nder this convention,the eigendecom position is unique only
ifallofthe eigenvalues are unique.
T he eigendecom position of a m atrix tells us m any useful facts about the
m atrix. T he m atrix is singular if and only if any of the eigenvalues are zero.
T he eigendecom position ofa realsym m etric m atrix can also be used to optim ize
quadratic expressionsofthe form L)aN A a ) a subjectto kka kk2 A 3.W henever a
isequalto an eigenvectorof) ,L takeson the valueofthe corresponding eigenvalue.
T he m axim um value of L w ithin the constraint region is the m axim um eigenvalue
and its m inim um value w ithin the constraintregion is the m inim um eigenvalue.
A m atrix w hose eigenvalues are allpositive is called aodWeWgS dSonWeS. A
m atrix w hoseeigenvaluesareallpositiveorzerovalued iscalled aodWeWgS dSm WdSo2
nWeS.Likew ise,ifalleigenvaluesarenegative,them atrix isnSgOeWgS dSonWeS,and
ifalleigenvaluesare negative orzerovalued,itisnSgOeWgS dSm WdSonWeS.Positive
sem idefinite m atrices are interesting because they guarantee thatha . a ) a M.
Positive definite m atrices additionally guarantee that a ) a A M d a A M.
In section M.7,we saw how to decom pose a m atrix into eigenvectorsand eigenvalues.
T he dWngflOc gOlfS dSxom aodWeWon (SV k ) provides another w ay to factorize
a m atrix,into dWngflOc gSxeocd and dWngflOc gOlfSd. T he SV k allow s us to
discoversom e ofthe sam e kind ofinform ation asthe eigendecom position.H owever,
..
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
the SV k is m ore generally applicable. }very real m atrix has a singular value
decom position, but the sam e is not true ofthe eigenvalue decom position. For
exam ple,ifa m atrix is not square,the eigendecom position is not defined,and we
m ust use a singular value decom position instead.
d ecallthatthe eigendecom position involves analyzing a m atrix ) to discover
a m atrix h ofeigenvectors and a vectorofeigenvaluesz such that w e can rew rite
) as
) A h diag)z Nh n 3( (M.2M)
t atrix inversion is not defined for m atrices that are not square. Suppose w e w ant
to m ake a leftinverse N ofa m atrix ) ,so that we can solve a linear equation
)aA b (M.22)
.1
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
a A N b( (M.2E)
Practicalalgorithm s for com puting the pseudoinverse are not based on this defini
tion,but ratherthe form ula

) A h   A . (M.27)
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
m atrix products and the trace operator. For exam ple,the trace operator provides
an alternative way ofw riting the Frobenius norm ofa m atrix:
L
kk) kk1 A Wu)) ) N( (M.2b)
T he trace ofa square m atrix com posed ofm any factors is also invariant to
m oving the last factor into the first position,ifthe shapes ofthe corresponding
m atrices allow the resulting product to be defined:
or m ore generally,
:T Tn
: 3
)ON )T N
Wu) 1 N A Wu)1 1 )ONN( (M.EM)
OA 3 OA 3
T his invariance to cyclic perm utation holds even ifthe resulting product has a
dierent shape.For exam ple,for ) l , S L T and N l , T L S ,we have
.1
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
.M
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
Iq A dujp lq o 2a  I  I 
I (M.T0)

A dujp lq o 2a  I  I I (M.TM)

W e can solve this optim ization problem using vectorcalculus (see section 2.3 if
you do notknow how to do this):
k )o 2a  I  I IN A M (M.T3)
o 2 a  2I A M (M.T2)
IA  a( (M.TE)
.3
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
T his m akes the algorithm e cient: we can optim ally encode a just using a
m atrixvector operation.To encode a vector,w e apply the encoder function
L)a N A  a( (M.TT)
u ext,we need to choose the encoding m atrix  . To do so,we revisit the idea
ofm inim izing the E 2 distance between inputs and reconstructions.Since w e w ill
use the sam e m atrix  to decode allofthe points,we can no longer consider the
points in isolation.Instead,w e m ust m inim ize the Frobenius norm ofthe m atrix
oferrors com puted over alldim ensions and allpoints:
e 1 M2
b )ON
q
 A dujp lq aP o r)a )ONNP subject to   A Md (M.Ta)
. O.P
T he above form ulation isthe m ost direct w ay ofperform ing the substitution,
but isnot the m ost stylistically pleasing way to w rite the equation. It places the
scalarvalue S a )ON on the rightofthe vectorS.It is m ore conventionalto w rite
scalar coe cientson the left ofvectorthey operate on. W e therefore usually w rite
such a form ula as
b
Sq A dujp lq kka)ON o S a )ONSkk22 subject to kkSkk2 A 3. (M.70)
. O
T he reader should aim to becom e fam iliar w ith such cosm etic rearrangem ents.
1)
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
A t this point,it can be helpfulto rew rite the problem in term s ofa single
design m atrix ofexam ples,rather than as a sum over separate exam ple vectors.
T his w illallow us to use m ore com pact notation. Let k l , S L T be the m atrix
defined by stacking allofthe vectorsdescribing the points,such that k O.e A a )ON (
W e can now rew rite the problem as
k isregarding the constraintfor the m om ent,w e can sim plify the Frobenius norm
portion as follow s:
dujp lqkkk o k SS kk21 (M.73)
.
31 M 1 M2
A dujp lq Wu k o k SS k o k SS (M.72)
.
(by equation M.2b)
A dujp lq Wu)k k o k
k SS o SS k
k  SS k
k SS N (M.7E)
.
A dujp lq Wu)k k No Wu)k k SS No Wu)SS k
k N Wu)SS k
k SS N
.
(M.7T)
A dujp lq o Wu)k k SS No Wu)SS k k N Wu)SS k k SS N (M.77)
.
(because term s not involving S do not aect the duj p lq)
A dujp lq o 2Wu)k k SS N Wu)SS k
k SS N (M.7a)
.
(because we can cycle the order ofthe m atrices inside a trace,equation M.EM)
A dujp lq o 2Wu)k k SS N Wu)k
k SS SS N (M.7b)
.
A dujp lq o 2Wu)k k SS N Wu)k
k SS N subject to S S A 3 (M.a1)
.
(due to the constraint)
A dujp lq o Wu)k k SS N subject to S S A 3 (M.aM)
.
1N
3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1
A dujp dx Wu)S k
k SN subject to S S A 3 (M.a2)
.
1,