You are on page 1of 22

http://www.deeplearningbook.org/contents/linear_algebra.

html

Linear algebra is a branch ofm athem atics that is w idely used throughout science
and engineering.H owever,because linear algebra isa form ofcontinuousrather
than discrete m athem atics,m any com puterscientistshave little experience w ith it.
A good understanding oflinear algebra is essentialfor understanding and working
w ith m any m achine learning algorithm s,especially deep learning algorithm s. W e
therefore precede our introduction to deep learning w ith a focused presentation of
the key linearalgebra prerequisites.
Ifyou are already fam iliar w ith linear algebra,feelfree to skip this chapter.If
you have previous experience w ith these concepts but need a detailed reference
sheet to review key form ulas,w e recom m end T he M atrix , HHkbHHk (Petersen and
Pedersen, M00T). Ifyou have no exposure at allto linear algebra,this chapter
w illteach you enough to read this book,butw e highly recom m end that you also
consult another resource focused exclusively on teaching linear algebra,such as
Shilov (1b77). T his chapter w illcom pletely om it m any im portantlinear algebra
topics that are not essentialfor understanding deep learning.

T he study oflinear algebra involvesseveraltypes ofm athem aticalobjects:

P Sxalacd:A scalaris just a single num ber,in contrastto m ost ofthe other
objectsstudied in linearalgebra,w hich areusually arraysofm ultiplenum bers.
W e w rite scalars in italics. W e usually give scalars lower-case variable nam es.
W hen w e introduce them ,we specify w hat kind ofnum ber they are. For

31

1 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

exam ple,we m ight say Let s l , be the slope ofthe line, w hile defining a
real-valued scalar,or LetT l N be the num ber ofunits, w hile defining a
naturalnum ber scalar.

P t Sxeocd: A vector is an array ofnum bers. T he num bers are arranged in


order. W e can identify each individualnum ber by its index in that ordering.
Typically w e give vectors lower case nam es w ritten in bold typeface,such
as a.T he elem entsofthe vector are identified by w riting its nam e in italic
typeface,w ith a subscript. T he first elem entofa is a3,the second elem ent
is a2 and so on. W e also need to say w hat kind ofnum bers are stored in
the vector.Ifeach elem entis in , ,and the vectorhas T elem ents,then the
vector liesin the set form ed by taking the i artesian product of, T tim es,
denoted as , T . W hen w e need to explicitly identify the elem entsofa vector,
we w rite them as a colum n enclosed in square brackets:
N i
a3
A a2 k
A k
a A A . k. (M.1)
h . .
aT

W e can think ofvectors as identifying points in space,w ith each elem ent
giving the coordinate along a dierentaxis.
Som etim esw e need to index a setofelem entsofa vector. In this case,w e
define a set containing the indices and w rite the set as a subscript. For
exam ple,to access a3,aE and aa,w e define the set e A i3.E.a} and w rite
a e .W e use the o sign to index the com plem entofa set.Forexam ple a n 3 is
the vector containing allelem ents ofa except for a3 ,and a n e is the vector
containing allofthe elem entsof a except for a3,aE and aa .

P k aecWxSd: A m atrix is a M-k array ofnum bers,so each elem entis identified
by tw o indices instead of just one. W e usually give m atrices upper-case
variable nam es w ith bold typeface,such as) . Ifa real-valued m atrix ) has
a height ofS and a w idth ofT,then w e say that ) l , S L T . W e usually
identify the elem entsofa m atrix using its nam e in italic but not bold font,
and the indices are listed w ith separating com m as.For exam ple,) 3.3 is the
upperleftentry of) and ) S .T isthe bottom rightentry. W e can identify all
ofthe num bers w ith verticalcoordinate aby w riting a e forthe horizontal
coordinate. For exam ple,) a.e denotes the horizontalcross section of) w ith
verticalcoordinate a.T his is know n as the a-th coh of) .Likew ise,) e.a is

-,

2 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

3 E
) 3.3 ) 3.2 1 M
) 3.3 ) 2.3 )E.3
) A 2 ) 2.3 ) 2.2 T d ) A
) 3.2 ) 2.2 )E.2
) E.3 ) E.2

kfdrob 0,1a T eb qoakpmlpb lcqeb j aqofu oak bb qelrdeqlcap a j foolo fj adb aoolpp qeb
j afk afadlkai,

the a-th xolfm n of ) . W hen w e need to explicitly identify the elem ents of
a m atrix,we w rite them as an array enclosed in square brackets:
E T
) 3.3 ) 3.2
. (M.M)
) 2.3 ) 2.2

Som etim esw e m ay need to index m atrix-valued expressions thatare notjust


a single letter. In this case,w e use subscripts after the expression,but do
not convert anything to lower case. For exam ple,L)) Na.P gives elem ent )a.PN
ofthe m atrix com puted by applying the function L to ) .

P r Sndocd: In som e cases we w illneed an array w ith m ore than two axes.
In the generalcase,an array ofnum bers arranged on a regular grid w ith a
variable num berofaxes is know n as a tensor. W e denote a tensor nam ed A
w ith this typeface: . . W e identify the elem entof. atcoordinates)a.P.xN
by w riting . a.P.x.

O ne im portant operation on m atrices is the ecOndaodS. T he transpose ofa


m atrix is the m irror im age ofthe m atrix across a diagonalline,called them OWn
dWOgonOl,running dow n and to the right,starting from its upper left corner. See
figure M.1 fora graphicaldepiction ofthis operation. W e denote the transpose ofa
m atrix ) as ) ,and itis defined such that

)) NO.P A ) P.O. (M.3)

Vectors can be thought of as m atrices that contain only one colum n. T he


transpose ofa vector is therefore a m atrix w ith only one row . Som etim es w e
--

3 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

define a vector by w riting out its elem ents in the text inline as a row m atrix,
then using the transpose operatorto turn it into a standard colum n vector,e.g.,
a A aa3.a2.aE b .
A scalar can be thoughtofas a m atrix w ith only a single entry. From this,w e
can see thata scalar is its ow n transpose: a A a .
W e can add m atrices to each other,as long as they have the sam e shape,just
by adding theircorresponding elem ents: , A ) - N w here , O.P A ) O.P - N O.P.
W e can also add a scalar to a m atrix or m ultiply a m atrix by a scalar,just
by perform ing that operation on each elem entofa m atrix:- A a rN - I w here
- O.P A a rN O.P - I.
In the context ofdeep learning,w e also use som e less conventionalnotation.
W e allow the addition ofm atrix and a vector,yielding anotherm atrix: , A ) - H,
w here , O.P A ) O.P - HP. In other words,the vector H is added to each row ofthe
m atrix.T his shorthand elim inates the need to define a m atrix w ith H copied into
each row before doing the addition. T his im plicit copying ofH to m any locations
is called PcoOdxOdeWng.

O ne ofthe m ost im portantoperations involving m atrices is m ultiplication oftwo


m atrices. T he m OecWi acodfxe ofm atrices ) and N is a third m atrix , . In
order for this product to be defined,) m ust have the sam e num ber ofcolum ns as
N hasrow s.If) isofshape S L T and N isofshape T L V,then , isofshape
S L V. W e can w rite the m atrix product just by placing two or m ore m atrices
together,e.g.
, A )N. (M.2)

T he product operation is defined by


b
, O.P A ) O.x N x.P. (M.E)
x

N ote thatthe standard product oftwo m atrices is aHt just a m atrix containing
the product ofthe individualelem ents.Such an operation exists and is called the
SlSm Sne2h WdS acodfxe or e OdOm Ocd acodfxe,and is denoted as ) N .
T he doe acodfxe betw een two vectors a and b ofthe sam e dim ensionality
is the m atrix product a b. W e can think of the m atrix product , A ) N as
com puting , O.P as the dotproduct between row O of) and colum n P of N .
-.

4 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

t atrix productoperationshavem any usefulpropertiesthatm akem athem atical


analysis of m atrices m ore convenient. For exam ple, m atrix m ultiplication is
distributive:
) )N - , N A ) N - ) , . (M.T)
It is also associative:
) )N , N A )) N N, . (M.7)
t atrix m ultiplication is aHt com m utative (the condition ) N A N ) does not
always hold),unlike scalar m ultiplication. H ow ever,the dot product between two
vectors is com m utative:
a b A b a. (M.a)

T he transpose ofa m atrix product hasa sim ple form :

)) N N A N )
. (M.b)

T his allow sus to dem onstrate equation M.a,by exploiting the fact that the value
ofsuch a product is a scalar and therefore equalto its ow n transpose:
1 M

a bA a b A b a. (M.10)

Since the focus ofthis textbook is not linear algebra,w e do not attem pt to
develop a com prehensive list ofusefulproperties ofthe m atrix product here,but
the reader should be aware thatm any m ore exist.
W e now know enough linear algebra notation to w rite dow n a system oflinear
equations:
)aA H (M.11)
w here ) l , S L T isa know n m atrix,H l , S isa know n vector,and a l , T isa
vectorofunknow n variablesw e would like to solve for.}ach elem ent aO ofa isone
ofthese unknow n variables. }ach row of) and each elem entofH provide another
constraint.W e can rew rite equation M.11 as:

) 3.ea A H3 (M.1M)

) 2.ea A H2 (M.13)
... (M.12)
) S .ea A HS (M.1E)
or,even m ore explicitly,as:

) 3.3a3 - ) 3.2a2 - rrr- ) 3.T aT A H3 (M.1T)


-1

5 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

N i
3 M M
h M 3 M}
M M 3

kfdrob 0,0a 7 TAS tI} Hk}rPHPg S APNHTa d efp fp ME ,

) 2.3a3 - ) 2.2a2 - rrr- ) 2.T aT A H2 (M.17)


... (M.1a)
) S .3a3 - ) S .2a2 - rrr- ) S .T aT A HS . (M.1b)

t atrix-vector product notation provides a m ore com pact representation for


equations ofthisform .

Linear algebra oersa powerfultoolcalled m OecWi WngScdWon that allow s us to


analytically solve equation M.11 for m any values of ) .
To describe m atrix inversion,w e firstneed to define the conceptofan WdSneWej
m OecWi. A n identity m atrix is a m atrix that does not change any vectorw hen w e
m ultiply that vectorby that m atrix. W e denote the identity m atrix that preserves
T-dim ensionalvectors asMT.Form ally,MT l , T L T ,and

ha l , T .MT a A a . (M.M0)

T he structure ofthe identity m atrix is sim ple: allofthe entriesalong the m ain
diagonalare 1,w hile allofthe otherentriesare zero. See figure M.M for an exam ple.
T he m OecWi WngScdS of) is denoted as ) n 3,and it is defined as the m atrix
such that
) n 3) A MT . (M.M1)

W e can now solve equation M.11 by the follow ing steps:

)aA H (M.MM)
) n 3) a A ) n3
H (M.M3)
MT a A ) n 3H (M.M2)
-0

6 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

a A ) n 3H. (M.ME)
n3
O fcourse,this process depends on it being possible to find ) . W e discuss
the conditions forthe existence of) n 3 in the follow ing section.
W hen ) n 3 exists,severaldierentalgorithm sexistforfinding itin closed form .
In theory,the sam e inverse m atrix can then be used to solve the equation m any
tim esfor dierentvaluesofH.H owever,) n 3 isprim arily usefulasa theoretical
tool,and should not actually be used in practice for m ost software applications.
h ecause ) n 3 can be represented w ith only lim ited precision on a digitalcom puter,
algorithm s that m ake use of the value of H can usually obtain m ore accurate
estim ates ofa .

In orderfor ) n 3 to exist,equation M.11 m usthave exactly one solution for every


value of H. H owever,it is also possible for the system ofequations to have no
solutions or infinitely m any solutions for som e values ofH. It is not possible to
have m ore than one but less than infinitely m any solutionsfor a particular HLif
both a and b are solutions then

c A p a - )3 o p Nb (M.MT)

is also a solution for any realp .


To analyze how m any solutions the equation has,w e can think ofthe colum ns
of ) as specifying dierent directions w e can travelfrom the ocWgWn (the point
specified by the vector ofallzeros),and determ ine how m any w ays there are of
reaching H.In this view ,each elem entofa specifies how far w e should travelin
each ofthese directions,w ith aO specifying how far to m ove in the direction of
colum n O: b
)aA aO) e.O. (M.M7)
O
In general,this kind ofoperation is called a lWnSOc xom PWnOeWon. Form ally,a
linearcom bination ofsom e setofvectorsin )3N.....n )T N} isgiven by m ultiplying
each vector n )ON by a corresponding scalar coe cient and adding the results:
b
IOn)ON. (M.Ma)
O

T he daOn ofa setofvectorsisthe setofallpointsobtainable by linearcom bination


ofthe originalvectors.
-1

7 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

k eterm ining w hether) a A H hasa solution thusam ountsto testing w hetherH


is in the span ofthe colum ns of) . T his particular span is know n as the xolfm n
daOxS orthe cOngS of ) .
In order for the system ) a A H to have a solution for allvalues ofH l , S ,
we therefore require that the colum n space of ) be allof , S .Ifany point in , S
is excluded from the colum n space,that pointis a potentialvalue ofH that has
no solution. T he requirem ent that the colum n space of) be allof, S im plies
im m ediately that ) m ust have at least S colum ns,i.e., T S . O therw ise,the
dim ensionality ofthe colum n space w ould be less than S . For exam ple,consider a
E L 2 m atrix. T he target H is 3-k ,but a is only M-k ,so m odifying the value ofa
at best allow sus to trace out a M-k plane w ithin , E. T he equation has a solution
ifand only if H lies on that plane.
H aving T S is only a necessary condition for every point to have a solution.
It is not a su cient condition,because it is possible for som e ofthe colum ns to
be redundant. i onsider a 2 L 2 m atrix w here both ofthe colum ns are identical.
T his has the sam e colum n space as a 2 L 3 m atrix containing only one copy ofthe
replicated colum n. In other words,the colum n space is stilljusta line,and fails to
encom pass allof, 2 ,even though there are two colum ns.
Form ally,this kind ofredundancy is know n as lWnSOc dSaSndSnxS.A set of
vectors is lWnSOclj WndSaSndSne ifno vector in the set is a linear com bination
ofthe other vectors. Ifw e add a vectorto a set that is a linear com bination of
the other vectorsin the set,the new vector doesnot add any pointsto the sets
span. T his m eans that for the colum n space ofthe m atrix to encom pass allof,S ,
the m atrix m ust contain at least one set ofS linearly independentcolum ns. T his
condition isboth necessary and su cientfor equation M.11 to have a solution for
every value ofH.N ote that the requirem entis for a set to have exactly S linear
independent colum ns,not at least S .N o setofS -dim ensionalvectorscan have
m ore than S m utually linearly independentcolum ns,buta m atrix w ith m ore than
S colum nsm ay have m ore than one such set.
In order for the m atrix to have an inverse,w e additionally need to ensure that
equation M.11 has atF HLt one solution foreach value ofH. To do so,w e need to
ensure that the m atrix has at m ost S colum ns. O therw ise there is m ore than one
way ofparam etrizing each solution.
Together,this m eans that the m atrix m ust bedbfOcS,that is,w e require that
S A T and that allofthe colum ns m ustbe linearly independent.A square m atrix
w ith linearly dependentcolum ns is know n as dWngflOc.
If) is not square or is square but singular,it can stillbe possible to solve the
equation. H owever,w e can not use the m ethod ofm atrix inversion to find the
-M

8 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

solution.
So farw e have discussed m atrix inverses asbeing m ultiplied on the left.It is
also possible to define an inverse that is m ultiplied on the right:

) ) n 3 A M. (M.Mb)

For square m atrices,the left inverse and rightinverse are equal.

Som etim eswe need to m easure thesize ofa vector.In m achine learning,we usually
m easure the size ofvectorsusing a function called a nocm .Form ally,theE V norm
is given by
7 a3
b V

kka kkV A ka OkV (M.30)


O

for V l , .V 3.
N orm s,including the E V norm ,are functions m apping vectors to non-negative
values. O n an intuitive level,the norm ofa vector a m easures the distance from
the origin to the point a. t ore rigorously,a norm is any function L that satisfies
the follow ing properties:

P L)a N A M d a A M

P L)a - bN L)a N- L)bN (the ecWOnglS WnSbfOlWej)

P hp l , .L)p aN A kp kL)a N

T he E 2 norm ,w ith V A 2,is know n as the b fxlWdSOn nocm . It is sim ply the
}uclidean distance from the origin to the point identified by a. T heE 2 norm is
used so frequently in m achine learning thatitis often denoted sim ply as kka kk,w ith
the subscript 2 om itted.It is also com m on to m easure the size ofa vectorusing
the squared E 2 norm ,w hich can be calculated sim ply as a a.
T he squared E 2 norm is m ore convenient to w ork w ith m athem atically and
com putationally than the E 2 norm itself. For exam ple, the derivatives of the
squared E2 norm w ith respect to each elem ent ofa each depend only on the
corresponding elem ent ofa,w hile allofthe derivatives ofthe E2 norm depend
on the entire vector.In m any contexts,the squared E 2 norm m ay be undesirable
because it increases very slow ly near the origin. In several m achine learning
-3

9 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

applications,it is im portant to discrim inate between elem ents that are exactly
zero and elem entsthatare sm allbutnonzero.In these cases,we turn to a function
that grow sat the sam e rate in alllocations,butretains m athem aticalsim plicity:
the E 3 norm .T he E 3 norm m ay be sim plified to
b
kka kk3 A kaOk. (M.31)
O

T he E 3 norm is com m only used in m achine learning w hen the dierence between
zero and nonzero elem entsis very im portant.}very tim e an elem entofa m oves
away from 0 by ,the E 3 norm increases by .
W e som etim esm easure thesize ofthe vectorby counting itsnum berofnonzero
elem ents.Som eauthorsreferto thisfunction asthe E M norm ,butthisisincorrect
term inology.T he num ber ofnon-zero entriesin a vector isnot a norm ,because
scaling the vectorby p does not change the num ber ofnonzero entries. T he E 3
norm is often used as a substitute for the num ber ofnonzero entries.
O ne other norm that com m only arises in m achine learning is the E t norm ,
also know n as the m Oi nocm . T his norm sim plifies to the absolute value ofthe
elem ent w ith the largest m agnitude in the vector,
kka kkt A p dx kaOk. (M.3M)
O

Som etim esw e m ay also w ish to m easure the size ofa m atrix. In the context
ofdeep learning,the m ostcom m on w ay to do this is w ith the otherw ise obscure
:coPSnWfd nocm : eb
kk) kk1 A ) 2O.P. (M.33)
O.P

w hich is analogousto the E 2 norm ofa vector.


T he dotproductoftwo vectorscan be rew ritten in term sofnorm s.Specifically,

a b A kka kk2kkbkk2 frvw (M.32)

w here w is the angle between a and b .

Som e specialkinds ofm atrices and vectorsare particularly useful.


a WOgonOlm atricesconsistm ostly ofzerosand have non-zero entriesonly along
the m ain diagonal. Form ally,a m atrix - is diagonalifand only if - O.P A M for
.)

10 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

allOA P. W e have already seen one exam ple ofa diagonalm atrix: the identity
m atrix,w here allofthe diagonalentriesare 1.W e w rite diag)nN to denote a square
diagonalm atrix w hose diagonalentriesare given by the entries ofthe vector n.
k iagonalm atrices are ofinterestin part because m ultiplying by a diagonalm atrix
isvery com putationally e cient. To com pute diag)nNa,w e only need to scale each
elem ent aO by nO.In other words,diag)nNa A n a.Inverting a square diagonal
m atrix is also e cient. T he inverse exists only ifevery diagonalentry is nonzero,
and in that case, diag)nNn 3 A diag)a3/n3.....3/nT b N. In m any cases, we m ay
derive som e very generalm achine learning algorithm in term sofarbitrary m atrices,
but obtain a less expensive (and less descriptive) algorithm by restricting som e
m atrices to be diagonal.
N otalldiagonalm atricesneed besquare.Itispossibletoconstructa rectangular
diagonalm atrix.N on-square diagonalm atrices do not have inverses but it is still
possible to m ultiply by them cheaply. For a non-square diagonalm atrix - ,the
product - a w illinvolve scaling each elem entofa,and either concatenating som e
zeros to the result if - is taller than it is w ide,or discarding som e ofthe last
elem ents ofthe vectorif - is w ider than it is tall.
A djm m SecWx m atrix is any m atrix that is equalto its ow n transpose:

) A ) . (M.3E)

Sym m etric m atrices often arise w hen the entriesare generated by som e function of
two argum ents that does not depend on the order ofthe argum ents.For exam ple,
if) isa m atrix ofdistance m easurem ents,w ith ) O.P giving the distancefrom point
Oto point P,then ) O.P A ) P.O because distance functions are sym m etric.
A fnWe gSxeoc is a vectorw ith fnWe nocm :

kka kk2 A 3. (M.3T)

A vectora and a vectorb are oceVogonOlto each other ifa b A M. Ifboth


vectors have nonzero norm ,this m eans that they are ata b0 degree angle to each
other.In , T ,at m ost T vectors m ay be m utually orthogonalw ith nonzero norm .
If the vectors are not only orthogonal but also have unit norm ,w e callthem
oceVonocm Ol.
A n oceVogonOlm OecWi is a square m atrix w hose row sare m utually orthonor-
m aland w hose colum ns are m utually orthonorm al:

) ) A ) ) A M. (M.37)

.N

11 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

T his im plies that


) n3 A ) . (M.3a)
so orthogonalm atricesareofinterestbecausetheirinverseisvery cheap to com pute.
Pay carefulattention to the definition oforthogonalm atrices.i ounterintuitively,
theirrow sare notm erely orthogonalbutfully orthonorm al. T here is no special
term for a m atrix w hose row sor colum ns are orthogonalbut not orthonorm al.

t any m athem atical objects can be understood better by breaking them into
constituentparts,orfinding som e propertiesofthem thatare universal,notcaused
by the way we choose to represent them .
For exam ple, integers can be decom posed into prim e factors. T he w ay we
represent the num ber 32 w illchange depending on w hether w e w rite it in base ten
orin binary,butitw illalw aysbe truethat32 A 2L 2L E.From thisrepresentation
we can conclude usefulproperties,such as that 32 isnot divisible by 7,or that any
integer m ultiple of32 w illbe divisible by E.
t uch as w e can discover som ething about the true nature ofan integer by
decom posing itinto prim e factors,we can also decom pose m atrices in ways that
show us inform ation about their functionalproperties that is not obvious from the
representation ofthe m atrix as an array ofelem ents.
O ne ofthe m ost w idely used kinds ofm atrix decom position is called SWgSn2
dSxom aodWeWon,in w hich w e decom pose a m atrix into a set ofeigenvectors and
eigenvalues.
A n SWgSngSxeoc ofa square m atrix ) is a non-zero vector n such that m ulti-
plication by ) alters only the scale of n:

) n A zn. (M.3b)

T he scalar z is know n as the SWgSngOlfS corresponding to this eigenvector. (O ne


can also find a lSTe SWgSngSxeoc such that n ) A zn , but w e are usually
concerned w ith right eigenvectors).
Ifn isan eigenvector of ) ,then so is any rescaled vectorsn fors l , .s
A M.
t oreover,sn stillhas the sam e eigenvalue.For this reason,we usually only look
for unit eigenvectors.
Suppose that a m atrix ) has T linearly independenteigenvectors,in)3N.....
n)T N},w ith corresponding eigenvalues iz3.....z T}.W e m ay concatenate allofthe
.,

12 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

,110.L.71.03M0TN0.L7b:.-T1.03M0TN-2e0:

N 017b0.E e2L3a23.-L37T ) 1L0b.E e2L3a23.-L37T


( (

/2 ..2.
2 2

/ /
..2. ..2.

.2.
. .
.2

/( ..(.
.(.
. ..(.
A/ A/

A2 A2

A( A(
A( A2 A/ . / 2 ( A( A2 A/ . / 2 (
./ ./.

kfdrob 0,1a e k bumj mib lc qeb bboq lc bfdbksboqlop mka bfdbksmirbp, F bob) t b emsb
m j mqofu ) t fqe qtl loqelkloj mibfdbksboqlop)n)3N t fqe bfdbksmirb z3 mka n )2N t fqe
bfdbksmirb z2, (Fslk/g b milq qeb pbq lcmiirkfq sboqlopm l , 2 mp m rkfq ofooib, (Lomnk/g b
milq qeb pbq lcmiimlfkqp) m, N v lnpbosfkd qeb t mv qemq ) afpqloqp qeb rkfq ofooib)t b
omk pbb qemq fq pomibp pmmob fk afoboqflk n)ON nv zO,

eigenvectors to form a m atrix h w ith one eigenvector per colum n:h A an)3N.(((.
n)T Nb.Likew ise,w e can concatenate the eigenvalues to form a vector z A az3 .(((.
zT b (T he SWgSndSxom aodWeWon of ) isthen given by
n3
) A h diag)z Nh ( (M.20)

W e have seen that :HaLtru:tiaA m atricesw ith specific eigenvalues and eigenvec-
tors allow s us to stretch space in desired directions. H ow ever,w e often want to
dSxom aodS m atricesinto their eigenvalues and eigenvectors. k oing so can help
us to analyze certain properties ofthe m atrix,m uch as decom posing an integer
into its prim e factors can help us understand the behaviorofthat integer.
u otevery m atrix can bedecom posed into eigenvaluesand eigenvectors.In som e
.-

13 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

cases,the decom position exists,butm ay involve com plex ratherthan realnum bers.
Fortunately,in this book,w e usually need to decom pose only a specific class of
m atrices that have a sim ple decom position. Specifically,every realsym m etric
m atrix can be decom posed into an expression using only real-valued eigenvectors
and eigenvalues:
) A : : . (M.21)
w here : is an orthogonal m atrix com posed of eigenvectors of ) , and is a
diagonalm atrix. T he eigenvalue O.O isassociated w ith the eigenvectorin colum n O
of: ,denoted as : e.O. h ecause : isan orthogonalm atrix,we can think of) as
scaling space by zO in direction n)ON.See figure M.3 for an exam ple.
W hile any realsym m etric m atrix ) is guaranteed to have an eigendecom posi-
tion,the eigendecom position m ay not be unique.Ifany two or m ore eigenvectors
share the sam e eigenvalue,then any setoforthogonalvectorslying in their span
are also eigenvectors w ith that eigenvalue,and we could equivalently choose a :
using those eigenvectors instead. h y convention,we usually sort the entries of
in descending order. g nder this convention,the eigendecom position is unique only
ifallofthe eigenvalues are unique.
T he eigendecom position of a m atrix tells us m any useful facts about the
m atrix. T he m atrix is singular if and only if any of the eigenvalues are zero.
T he eigendecom position ofa realsym m etric m atrix can also be used to optim ize
quadratic expressionsofthe form L)aN A a ) a subjectto kka kk2 A 3.W henever a
isequalto an eigenvectorof) ,L takeson the valueofthe corresponding eigenvalue.
T he m axim um value of L w ithin the constraint region is the m axim um eigenvalue
and its m inim um value w ithin the constraintregion is the m inim um eigenvalue.
A m atrix w hose eigenvalues are allpositive is called aodWeWgS dSonWeS. A
m atrix w hoseeigenvaluesareallpositiveorzero-valued iscalled aodWeWgS dSm WdSo2
nWeS.Likew ise,ifalleigenvaluesarenegative,them atrix isnSgOeWgS dSonWeS,and
ifalleigenvaluesare negative orzero-valued,itisnSgOeWgS dSm WdSonWeS.Positive
sem idefinite m atrices are interesting because they guarantee thatha . a ) a M.
Positive definite m atrices additionally guarantee that a ) a A M d a A M.

In section M.7,we saw how to decom pose a m atrix into eigenvectorsand eigenvalues.
T he dWngflOc gOlfS dSxom aodWeWon (SV k ) provides another w ay to factorize
a m atrix,into dWngflOc gSxeocd and dWngflOc gOlfSd. T he SV k allow s us to
discoversom e ofthe sam e kind ofinform ation asthe eigendecom position.H owever,

..

14 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

the SV k is m ore generally applicable. }very real m atrix has a singular value
decom position, but the sam e is not true ofthe eigenvalue decom position. For
exam ple,ifa m atrix is not square,the eigendecom position is not defined,and we
m ust use a singular value decom position instead.
d ecallthatthe eigendecom position involves analyzing a m atrix ) to discover
a m atrix h ofeigenvectors and a vectorofeigenvaluesz such that w e can rew rite
) as
) A h diag)z Nh n 3( (M.2M)

T he singularvalue decom position is sim ilar,except this tim e w e w illw rite )


as a product ofthree m atrices:

) A A-h ( (M.23)

Suppose that) isan S L T m atrix.T hen A isdefined to be an S L S m atrix,


- to be an S L T m atrix,and h to be an T L T m atrix.
}ach ofthese m atrices is defined to have a specialstructure. T he m atrices A
and h are both defined to be orthogonalm atrices. T he m atrix - is defined to be
a diagonalm atrix.u ote that - is not necessarily square.
T he elem ents along the diagonalof- are know n as the dWngflOc gOlfSd of
the m atrix ) .T he colum nsofA are know n as the lSTe2dWngflOc gSxeocd.T he
colum ns ofh are know n asas the cWgVe2dWngflOc gSxeocd.
W e can actually interpretthe singularvalue decom position of) in term s of
the eigendecom position offunctions of) .T he left-singular vectors of) are the
eigenvectorsof) ) .T he right-singularvectorsof) are the eigenvectorsof) ) .
T he non-zero singular values of ) are the square roots ofthe eigenvalues of) ) .
T he sam e istrue for ) ) (
Perhaps the m ost usefulfeature ofthe SV k is thatw e can use itto partially
generalize m atrix inversion to non-square m atrices,as we w ill see in the next
section.

t atrix inversion is not defined for m atrices that are not square. Suppose w e w ant
to m ake a left-inverse N ofa m atrix ) ,so that we can solve a linear equation

)aA b (M.22)

.1

15 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

by left-m ultiplying each side to obtain

a A N b( (M.2E)

k epending on the structure ofthe problem ,it m ay not be possible to design a


unique m apping from ) to N .
If ) is taller than it is w ide, then it is possible for this equation to have
no solution. If) is w ider than it is tall,then there could be m ultiple possible
solutions.
T he k oocS2P SncodS adSfdoWngScdS allow s us to m ake som e headw ay in
these cases. T he pseudoinverse of ) is defined as a m atrix

) - A olp )) ) - p MNn 3) ( (M.2T)


p M

Practicalalgorithm s for com puting the pseudoinverse are not based on this defini-
tion,but ratherthe form ula
-
) A h - - A . (M.27)

w here A ,- and h arethesingularvaluedecom position of) ,and thepseudoinverse


- - ofa diagonalm atrix - is obtained by taking the reciprocalofits non-zero
elem ents then taking the transpose ofthe resulting m atrix.
W hen ) has m ore colum ns than row s,then solving a linearequation using the
pseudoinverse providesone ofthe m any possible solutions. Specifically,it provides
the solution a A ) - b w ith m inim al}uclidean norm kka kk2 am ong allpossible
solutions.
W hen ) has m ore row sthan colum ns,it is possible for there to be no solution.
In this case,using the pseudoinverse givesus the a for w hich ) a is as close as
possible to b in term s of}uclidean norm kk) a o bkk2.

T he trace operator givesthe sum ofallofthe diagonalentries ofa m atrix:


b
Wu)) N A ) O.O( (M.2a)
O

T he trace operator is usefulfor a variety ofreasons. Som e operations that are


di cult to specify w ithout resorting to sum m ation notation can be specified using
.0

16 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

m atrix products and the trace operator. For exam ple,the trace operator provides
an alternative way ofw riting the Frobenius norm ofa m atrix:
L
kk) kk1 A Wu)) ) N( (M.2b)

W riting an expression in term s ofthe trace operator opens up opportunities to


m anipulate the expression using m any usefulidentities. For exam ple,the trace
operator is invariant to the transpose operator:

Wu)) N A Wu)) N( (M.E0)

T he trace ofa square m atrix com posed ofm any factors is also invariant to
m oving the last factor into the first position,ifthe shapes ofthe corresponding
m atrices allow the resulting product to be defined:

Wu)) N , N A Wu), ) N N A Wu)N , ) N (M.E1)

or m ore generally,
:T Tn
: 3
)ON )T N
Wu) 1 N A Wu)1 1 )ONN( (M.EM)
OA 3 OA 3

T his invariance to cyclic perm utation holds even ifthe resulting product has a
dierent shape.For exam ple,for ) l , S L T and N l , T L S ,we have

Wu)) N N A Wu)N ) N (M.E3)

even though ) N l , S L S and N ) l , T L T .


A nother usefulfact to keep in m ind is that a scalar is its ow n trace: P A Wu)PN.

T he determ inant of a square m atrix, denoted ghw)) N, is a function m apping


m atrices to realscalars. T he determ inant is equal to the product of all the
eigenvalues ofthe m atrix. T he absolute value ofthe determ inantcan be thought
ofas a m easure ofhow m uch m ultiplication by the m atrix expands or contracts
space.Ifthe determ inantis 0,then space iscontracted com pletely along at least
one dim ension,causing it to lose allofits volum e.Ifthe determ inantis 1,then
the transform ation preservesvolum e.

.1

17 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

O nesim plem achinelearning algorithm ,acWnxWaOlxom aonSned OnOljdWd orPi A


can be derived using only know ledge ofbasic linear algebra.
Suppose we have a collection ofS points ia)3N.(((.a )S N} in , T .Suppose w e
would like to apply lossy com pression to these points.Lossy com pression m eans
storing the pointsin a w ay that requires less m em ory but m ay lose som e precision.
W e would like to lose as little precision as possible.
O ne way we can encode these pointsisto representa low er-dim ensionalversion
ofthem .Foreach point a)ON l , T wew illfind a corresponding codevector I)ON l , d.
Ifd issm aller than T,it w illtake lessm em ory to store the code pointsthan the
originaldata. W e w illwantto find som e encoding function that produces the code
foran input,L)aN A I,and a decoding function that produces the reconstructed
input given itscode,a t)L)a NN.
Pi A isdefined by ourchoice ofthedecoding function.Specifically,to m ake the
decoder very sim ple,w e choose to use m atrix m ultiplication to m ap the code back
into , T . Let t)IN A - I,w here - l ,T L d is the m atrix defining the decoding.
i om puting the optim alcode for this decoder could be a di cult problem .To
keep theencoding problem easy,Pi A constrainsthe colum nsof- to beorthogonal
to each other.(u ote that- is stillnottechnically an orthogonalm atrix unless
dA T)
W ith the problem as described so far,m any solutions are possible,because w e
can increase the scale of- e.O ifw e decrease IO proportionally forallpoints.To give
the problem a unique solution,w e constrain allofthe colum ns of - to have unit
norm .
In orderto turn this basic idea into an algorithm w e can im plem ent,the first
thing we need to do is figure out how to generate the optim alcode pointIq for
each input point a. O ne way to do this isto m inim ize the distance between the
input pointa and its reconstruction,t)IqN. W e can m easure this distance using a
norm .In the principalcom ponentsalgorithm ,w e use the E 2 norm :

Iq A dujp lq kka o t)INkk2( (M.E2)


-

W e can sw itch to the squared E 2 norm instead ofthe E 2 norm itself,because


both are m inim ized by the sam e value of I. h oth are m inim ized by the sam e
value of I because the E 2 norm is non-negative and the squaring operation is

.M

18 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

m onotonically increasing for non-negative argum ents.

Iq A dujp lq kka o t)INkk22( (M.EE)


-

T he function being m inim ized sim plifies to

)a o t)INN )a o t)INN (M.ET)

(by the definition ofthe E 2 norm ,equation M.30)

A a a o a t)INo t)IN a - t)IN t)IN (M.E7)

(by the distributive property)

A a a o 2a t)IN- t)IN t)IN (M.Ea)

(because the scalar t)IN a isequalto the transpose ofitself).


W e can now change the function being m inim ized again,to om it the first term ,
since this term does notdepend on I:

Iq A dujp lq o 2a t)IN- t)IN t)IN( (M.Eb)


-

To m ake furtherprogress,w e m ustsubstitute in the definition of t)IN:

Iq A dujp lq o 2a - I - I -
-I (M.T0)
-

A dujp lq o 2a - I - I MdI (M.T1)


-

(by the orthogonality and unit norm constraintson - )

A dujp lq o 2a - I - I I (M.TM)
-

W e can solve this optim ization problem using vectorcalculus (see section 2.3 if
you do notknow how to do this):

k -)o 2a - I - I IN A M (M.T3)

o 2- a - 2I A M (M.T2)

IA - a( (M.TE)

.3

19 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

T his m akes the algorithm e cient: we can optim ally encode a just using a
m atrix-vector operation.To encode a vector,w e apply the encoder function

L)a N A - a( (M.TT)

g sing a further m atrix m ultiplication,we can also define the Pi A reconstruction


operation:
r)a N A t )L )a NN A - - a( (M.T7)

u ext,we need to choose the encoding m atrix - . To do so,we revisit the idea
ofm inim izing the E 2 distance between inputs and reconstructions.Since w e w ill
use the sam e m atrix - to decode allofthe points,we can no longer consider the
points in isolation.Instead,w e m ust m inim ize the Frobenius norm ofthe m atrix
oferrors com puted over alldim ensions and allpoints:

e 1 M2
b )ON
q
- A dujp lq aP o r)a )ONNP subject to - - A Md (M.Ta)
. O.P

To derive the algorithm forfinding - q,w e w illstart by considering the case


w here dA 3.In this case,- is justa single vector,S.Substituting equation M.T7
into equation M.Ta and sim plifying - into S,the problem reduces to
b
Sq A dujp lq kka)ON o SS a )ONkk22 subject to kkSkk2 A 3( (M.Tb)
. O

T he above form ulation isthe m ost direct w ay ofperform ing the substitution,
but isnot the m ost stylistically pleasing way to w rite the equation. It places the
scalarvalue S a )ON on the rightofthe vectorS.It is m ore conventionalto w rite
scalar coe cientson the left ofvectorthey operate on. W e therefore usually w rite
such a form ula as
b
Sq A dujp lq kka)ON o S a )ONSkk22 subject to kkSkk2 A 3. (M.70)
. O

or,exploiting the fact that a scalar is its ow n transpose,as


b
Sq A dujp lq kka )ON o a )ON SSkk22 subject to kkSkk2 A 3( (M.71)
. O

T he reader should aim to becom e fam iliar w ith such cosm etic rearrangem ents.

1)

20 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

A t this point,it can be helpfulto rew rite the problem in term s ofa single
design m atrix ofexam ples,rather than as a sum over separate exam ple vectors.
T his w illallow us to use m ore com pact notation. Let k l , S L T be the m atrix

defined by stacking allofthe vectorsdescribing the points,such that k O.e A a )ON (
W e can now rew rite the problem as

S q A dujp lq kkk o k SS kk21 subject to S S A 3( (M.7M)


.

k isregarding the constraintfor the m om ent,w e can sim plify the Frobenius norm
portion as follow s:
dujp lqkkk o k SS kk21 (M.73)
.
31 M 1 M2

A dujp lq Wu k o k SS k o k SS (M.72)
.
(by equation M.2b)

A dujp lq Wu)k k o k
k SS o SS k
k - SS k
k SS N (M.7E)
.


A dujp lq Wu)k k No Wu)k k SS No Wu)SS k
k N- Wu)SS k
k SS N
.
(M.7T)

A dujp lq o Wu)k k SS No Wu)SS k k N- Wu)SS k k SS N (M.77)
.
(because term s not involving S do not aect the duj p lq)

A dujp lq o 2Wu)k k SS N- Wu)SS k
k SS N (M.7a)
.

(because we can cycle the order ofthe m atrices inside a trace,equation M.EM)

A dujp lq o 2Wu)k k SS N- Wu)k
k SS SS N (M.7b)
.

(using the sam e property again)


A t this point,we re-introduce the constraint:

dujp lq o 2Wu)k k SS N- Wu)k


k SS SS N subject to S S A 3 (M.a0)
.


A dujp lq o 2Wu)k k SS N- Wu)k
k SS N subject to S S A 3 (M.a1)
.
(due to the constraint)

A dujp lq o Wu)k k SS N subject to S S A 3 (M.aM)
.
1N

21 de 22 12/04/2017 01:44 a.m.


http://www.deeplearningbook.org/contents/linear_algebra.html

3 a 1 h F E k 2. LbN E 1 k 1 L7 E M k 1

A dujp dx Wu)k k SS N subject to S S A 3 (M.a3)


.

A dujp dx Wu)S k
k SN subject to S S A 3 (M.a2)
.

T hisoptim ization problem m ay besolved using eigendecom position.Specifically,


the optim alS is given by the eigenvector ofk k corresponding to the largest
eigenvalue.
T his derivation is specific to the case ofd A 3 and recovers only the first
principalcom ponent.t ore generally,w hen w e w ish to recover a basis ofprincipal
com ponents,the m atrix - is given by the deigenvectors corresponding to the
largest eigenvalues. T his m ay be show n using proofby induction. W e recom m end
w riting this proofas an exercise.
Linear algebra is one of the fundam ental m athem atical disciplines that is
necessary to understand deep learning. A nother key area ofm athem aticsthat is
ubiquitous in m achine learning is probability theory,presented next.

1,

22 de 22 12/04/2017 01:44 a.m.