Professional Documents
Culture Documents
Linear Algebra: Lecture Slides For Chapter 2 of Deep Learning Ian Goodfellow 2016-06-24
Linear Algebra: Lecture Slides For Chapter 2 of Deep Learning Ian Goodfellow 2016-06-24
(Goodfellow 2016)
Scalars
• A scalar is a single number
a, n, x
(Goodfellow 2016)
rder. We can identify each individual number by its index in that ordering.
order. WeVectors
ypeface, with a subscript. The first element of x is x1 , the second element
6
x1
6 x2 7
3
7
(2.1)
typeface, with a subscript. The fi
x = 6 . 7.
.
4 . 5
xn
of x• Aexcept
matrix is afor x1 , ofxnumbers:
2-D array 3 and x6 .
th column of A. When we need to explicitly identify the elements of a
x, we write them as an array enclosed in square brackets:
Row
ray of numbers, so each element is identifi
A1,1 A1,2
A2,1 A2,2
. (2.2)
• or more dimensions.
(Goodfellow 2016)
ations have many useful properties that make mathematical
rtant operation on matrices is the transpose. The transpose of a
more convenient.
CHAPTER 2. LINEAR For
ALGEBRA example, matrix multiplication
mirror image of the matrix across a diagonal line, called the main
is
graphical
A(B depiction
+
Matrix Transpose
ning down and to the right, starting from its upper left corner. See
C)of =
thisAB
operation.
+ AC.We denote the transpose of a (2.6)
A> , and it is defined such that
:
> >
x transpose
Figure 2.1: The y = yof x. (2.8)the
the matrix can be thought of as a mirror image across
main diagonal.
matrix product has a simple form:
the i-th column of A. When we need to explicitly identify the elements of a
> > as >
(AB) = B A an. array enclosed in square brackets: (2.9)
matrix, we write them
A1,1 A1,2
(2.2)
onstrate Eq. 2.8, by exploiting A2,1 the
A2,2fact that the value of
.
(Goodfellow 2016)
e defined,
duct A must A
of matrices have theBsame
and is anumber of columns
third matrix as B
C. In has
order
pe m⇥
ned, and B
Anmust is of
have theshape ⇥ p, thenofCcolumns
samen number is of shape
as B ⇥ p.
m has
atrix product
⇥ n and B isjust by placing
of shape n ⇥ two or more
p, then C ismatrices
of shapetogether,
m ⇥ p.
Matrix
C = AB. (Dot) Product
product just by placing two or more matrices together,
(2.4)
C = AB.
ration is defined by (2.4)
X
n is defined by=
Ci,j Ai,k Bk,j . (2.5)
X k
Ci,j = Ai,k Bk,j . (2.5)
andard product kof two matrices is not just a matrix containing
ndividual elements. Such an operation exists and is called the
td or Hadamard
product product,
of two and is
matrices is denoted
not just as
a matrix
A B. containing
m elements.
dual
between Suchx an
two vectors =
y of m
andoperation exists
the same • isncalledis the
and
dimensionality the
. We can think
Hadamard of the
product, matrix
and productas
is denoted CA = AB B.as computing
uct between row i of A and column j of B.
en two vectors x and y of the same dimensionality is thep
p
can think of the matrix
34 n Must
product C = AB as computing
match
etween row i of A and column j of B.
(Goodfellow 2016)
Inverse Matrices
PTER 2. LINEAR ALGEBRA
Identity Matrix
werful tool called matrix
2 inversion
1 0 0
3 that allows us to
for many values of A. 4 0 1 0 5
0 0 1
ersion, we first need to define the concept of an identity
x is a matrix that does not change any vector when we
Figure 2.2: Example identity matrix: This is I3 .
at matrix. We denote the identity matrix that preserves
n . Formally, IAn 2xR+ A , xand
n⇥n
2,1 1 2,2 2 + ··· + A x = b
2,n n 2 (2
n
8x 2 R , In x = x. ... (2.20)(2
Am,1 x1 + Am,2 x2 + · · · + Am,n xn = bm . (2
ity matrix is simple: all of the entries along the main
Matrix-vector product notation provides a more compact representation
the other
ations entries are zero. See Fig. 2.2 for an example.
of this form. (Goodfellow 2016)
aware that many more exist.
t of useful properties of the matrix product here, but
ough linear algebra notation to write down a system of linear
that many more exist.
Systems of Equations
ear algebra notation
Ax = b to write down a system (2.11)
of linear
a known matrix, b 2 Rm is a known vector, and x 2 Rn is a
riables we would
Ax =like (2.11)
b to solve for. Each element xi of x is one
iables. Each row of A and each element of b provide another
newrite
matrix, b 2 R
Eq. 2.11 as:
m is a known vector,
expands to and x 2 R n is a
• No solution
• Many solutions
(Goodfellow 2016)
f the identity matrix is simple: all of the entries along the main
identity matrix is simple: all of the entries along the main
while all of the other entries are zero. See Fig. 2.2 for an example.
Matrix Inversion
all of the other entries are zero. See Fig. 2.2 for an example.
inverse of A is denoted as A 1 , and it is defined as the matrix
se of A is denoted as A , and it is defined as the matrix
1
• Matrix inverse:
A 1 A = In . (2.21)
1
A A = In . (2.21)
solve•Eq. 2.11 by
Solving a the following
system usingsteps:
an inverse:
Eq. 2.11 by the following steps:
Ax = b (2.22)
Ax = b= A 1 b
A 1 Ax (2.22)
(2.23)
1 11
A Ax ==AA bb
In x (2.23)
(2.24)
1
• I x = A 36 b
Numerically unstable, but useful for abstract
n (2.24)
analysis 36
(Goodfellow 2016)
Invertibility
• Matrix can’t be inverted if…
(Goodfellow 2016)
is given by
!1
X p
Norms
p
||x||p = |xi |
i
for p 2 R, p 1.
• Functions that measure
Norms, including the Lhow “large”
p norm, are afunctions
vector ismapping vect
values. On an intuitive level, the norm of a vector x measure
• the origintotoathe
Similar point x.
distance More rigorously,
between zero and thea norm
pointis any func
the following by
represented properties:
the vector
• f (x) = 0 ) x = 0
p
• L
The L1 norm
norm is commonly used in machine learning when the difference between
APTERzero
2. LINEAR ALGEBRA
and nonzero elements is very important. 1 !
Every time an element of x moves
X
away from 0 by ✏, the L1 norm increases by ✏.
p
p
ause it increases
||x|| =
We sometimes measure
p
very slowly the the
near
|x |
size origin.
of thei vector by counting
In several machineits number
learning of nonzero
elements. Some authors refer to
lications, it is important to discriminate this function as the “L 0 norm,” but this is incorrect
between elements that are exactly
i
and terminology.
elements that The number
are small but of non-zero
nonzero. In entries in a we
these cases, vector
turnistonot a norm, because
a function
scaling the vector by ↵ does not change the number of nonzero
t grows at the same rate in all locations, but retains mathematical simplicity: entries. The L 1
• Most
norm
L1 norm. The Lpopular
is often
1 norm as norm:
used may a be L2 norm,
substitute
simplified fortothe p=2
number of nonzero entries.
One
p other norm that commonly arises in machine learning is the 1 norm,
the L norm, are functions mapping vectors to non-n
also known
• L1 as the
norm, max ||x||
p=1: norm.
X
1 = This|xnorm
i |. simplifies to the absolute
L
value of the
(2.31)
ive level, the norm of a vector x measures the distan
element with the largest magnitude i in the vector,
L1 norm is commonly
• Max norm, used in machine
infinite p: ||x||learning when the difference between
ont More rigorously, a Every
norm i timeis
an any
elementfunction f that
(2.32)
andx.
1 = max |xi |.
nonzero elements is very important. of x moves
ties: Sometimes we may also wish to measure the size of a matrix. In the context
y from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the
of deep learning, the size
mostofcommon
the vector by to
way counting
do thisitsis number
with theof otherwise
nonzero obscure
ments.Frobenius
Some authors
norm refer to this function as the “L 0 norm,” but this is incorrect
sX (Goodfellow 2016)
minology. The number of non-zero entries in a vector is not a norm, because
1 > n
A=A . (2.35)
esmachine
often arise learning
when the algorithm
entries arein terms of arbitrary
generated matrices,
by some function of
at
endoes
sive (and
arise not
when depend
less on the order
descriptive)
the entries of the arguments.
arealgorithm
generated by someFor
by restricting example,
functionsome of
es
A
ance
Special Matrices and Vectors
distance
not depend
= A
measurements,
on the order
because
with A
distance
giving
ofi,jthe
functions
the distance
arguments.
are
Forfrom
symmetric.
point
example,
ices measurements,
need be square.with It is A i,j giving the distance from point
i,j j,i
possible to construct a rectangular
is
= aAvector with unit
because norm:functions are symmetric.
distance
quare •diagonal matrices do not have inverses but it is still
j,i
Unit vector:
ector
them with cheaply.unit For
norm:
||x||a2 =non-square
1. diagonal matrix D,(2.36) the
scaling each element of x, and either concatenating some
||x||2 = 1. (2.36)
is taller than it is wide, or discarding
nd a vector y are orthogonal to each other if x y = 0. If both
some
> of the last
ero is wider
D norm, this than
means it that
is tall.
they are at a 90 degree angle to each
• Symmetric Matrix:
vector
is any
most are orthogonal
ymatrix
n vectors to each
thatbeismutually
may equal other
owniftranspose:
toorthogonal
its x >
with = 0. If norm.
y nonzero both
eorm,
not this
only means that
orthogonal >
they
but are
also at
have a 90
unit degree
norm, angle
we to
call each
them
A = A .
n vectors may be mutually orthogonal with nonzero norm. (2.35)
LGEBRA
only
ariseorthogonal
nmatrix when
is but also
the entries
a square matrix arehave
whose unit
generated
rows are norm,
by some
mutuallyweorthonormal
call them
function of
• Orthogonal matrix:
ns
s notare depend
mutuallyon orthonormal:
the order of the arguments. For example,
nce
ix ismeasurements,
a squareAmatrix>
A =with whose
AA giving
> i,j rows
A = I. arethe distanceorthonormal
mutually from (2.37)
point
Aj,i because
=e mutually A distance
orthonormal:
1
=A > functions are symmetric.
41 , (2.38)
(Goodfellow 2016)
> >
atrix as an array of elements.
dely used kinds of matrix decomposition is called eigen-
hexample Eigendecomposition
we decompose
of the effect ofaeigenvectors
matrix
h two orthonormal eigenvectors, v (1)
into a set of
and eigenvalues. eigenvectors
Here, we have
with eigenvalue 1 and v
(2)
and
with
Left) We plot the set of all unit vectors u 2 R2 as a unit circle. (Right)
of all points Au. By observing the way that A distorts the unit circle, we
square
cales
ALGEBRAspacematrix
in A
direction vis
(i) a non-zero vector v such that multipli-
by i .
the scale of v:
Eigenvector
form a• matrix and
V with one eigenvalue:
eigenvector per column: V = [v (1) , . . . ,
, we can concatenate
Av the eigenvalues
= v. to form a vector = [ 1 , . . . , (2.39)
on exists, but may involve complex rather than real numbers.
ndecomposition of A is then given by
• Eigendecomposition of a diagonalizable matrix:
ook, we usually need to decompose only a specific class of
assimple
the eigenvalue corresponding
diag( 1
decomposition. Specifically,
A = V )V . to this
everyeigenvector.
real (2.40) (One
symmetric
vector into
posed • such
Every
anthat >
realv symmetric > , only
but we
v matrix
A = using
expression area usually
has concerned
real,eigenvectors
real-valued orthogonal
en that constructing matrices with specific eigenvalues and eigenvec-
. stretcheigendecomposition:
to space in desired directions. However, we often want to
> eigenvectors. Doing so can help us
rices into their eigenvalues
or of A, thenAso = is
Q⇤Q and ,
any rescaled vector sv for s 2 R,(2.41)
s 6= 0.
ain properties of the matrix, much as decomposing an integer into
he
rs same
gonal
can eigenvalue.
matrix
help us composed
understand Forbehavior
the ofthis reason,
eigenvectors weofusually
of that integer. A, andonly
⇤ islook
a
eigenvalue
matrix ⇤i,i is associated
can be decomposed with the
into eigenvalues eigenvectorIninsome
and eigenvectors. column i(Goodfellow 2016)
Effect of Eigenvalues
Before multiplication After multiplication
3 3
λ1 v(1)
2 2
1 1
v(1) v(1)
0 0
x1
x1′
λ1 v(1)
v(1) v(1)
−1 −1
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x( x(′
(Goodfellow 2016)
LINEAR ALGEBRA
1
• = V diag(
Similar to eigendecomposition
A )V . (2.42)
A = U DV > . (2.43)
Moore-Penrose Pseudoinverse
+ + >
(2.47)
pseudoinverse provides one of the ma
A =VD U ,
olutions.
as more columns than rows, then solving a linear equation using the
• If the equation has:
provides one of the many possible solutions. Specifically, it provides
x = A+ y with minimal Euclidean norm ||x||2 among all possible
When •ExactlyA has more
one solution: this is therows
same as thethaninverse.colu
n this
using •Nocase,
the pseudoinverse using
us the
thewhich
x for pseudoinverse
as more rows than columns, it is possible for there to be no solution.
solution:gives
this gives
the us solution
Ax iswith the as
as close
in terms of smallest
Euclideanerror
norm ||Ax y|| .
possible to y in terms of Euclidean n
2
+ + >
A =VD U , (2.47)
(Goodfellow 2016)