You are on page 1of 23

Linear Algebra

Lecture slides for Chapter 2 of Deep Learning


Ian Goodfellow
2016-06-24
About this chapter

• Not a comprehensive survey of all of linear algebra

• Focused on the subset most relevant to deep


learning

• Larger subset: e.g., Linear Algebra by Georgi Shilov

(Goodfellow 2016)
Scalars
• A scalar is a single number

• Integers, real numbers, rational numbers, etc.

• We denote it with italic font:

a, n, x

(Goodfellow 2016)
rder. We can identify each individual number by its index in that ordering.

• Vectors: A vector is an array of


ypically we give vectors lower case names written in bold typeface, such
s x. The elements of the vector are identified by writing its name in italic

order. WeVectors
ypeface, with a subscript. The first element of x is x1 , the second element

can identify each indiv


x2 and so on. We also need to say what kind of numbers are stored in
he vector. If each element is in R, and the vector has n elements, then the
ector lies in the set formed by taking the Cartesian product of R n times,
Typically we give vectors lower
enoted as• Rn . When we need to explicitly identify the elements of a vector,
A vector is a 1-D array of numbers:
e write them as a column enclosed in square brackets:

as x. The elements of the vector


2

6
x1
6 x2 7
3

7
(2.1)
typeface, with a subscript. The fi
x = 6 . 7.
.
4 . 5
xn

is x2 and so on. We also need t


We can think of vectors as identifying points in space, with each element
ving the coordinate along a different axis.
• Can be real, binary, integer, etc.
the vector. If each element is in R
ometimes we need to index a set of elements of a vector. In this case, we
efine a set containing the indices and write the set as a subscript. For
xample, to access x1 , x3 and x6 , we define the set S = {1, 3, 6} and write
vector lies in the set formed by t
• Example notation for type and size:
S . We use the sign to index the complement of a set. For example x 1 is
he vector containing all elements of x except for x1 , and x S is the vector

denoted as R . When we need to


n
ontaining all of the elements of x except for x1 , x3 and x6 .

Matrices: A matrix is a 2-D array of numbers, so each element is identified by


wo indices instead of just one. We usually give matrices upper-case variable
(Goodfellow 2016)
ames with bold typeface, such as A. If a real-valued matrix A has a height
ndices and write the set as a subscript
x6 , we define the set S = {1, 3, 6} and
2
A1,1
A = 4 A2,1
A3,1
Matrices
A1,2
3

x the complement of a set. For example


A2,2 5 ) A> =
A3,2

A1,1
A1,2
A2,1
A2,2
A3,1
A3,2

ents of x except for x1 , and x S is the


The transpose of the matrix can be thought of as a mirror image across the
nal.

of x• Aexcept
matrix is afor x1 , ofxnumbers:
2-D array 3 and x6 .
th column of A. When we need to explicitly identify the elements of a
x, we write them as an array enclosed in square brackets:

Row
ray of numbers, so each element is identifi
A1,1 A1,2
A2,1 A2,2
. (2.2)

. We usually give matrices upper-case va


times we may need to index matrix-valued expressions that are not just
gle letter. In this case, we use subscripts after the expression, but do
Column
h as• A. If anotation
real-valued
for type and matrix A has a
onvert anything to lower case. For example, f (A)i,j gives element (i, j)
Example shape:
e matrix computed by applying the function f to A.

we say that A 2 R . We usually id


ors: In some cases we will need an array with more than two axes. In
m⇥n
eneral case, an array of numbers arranged on a regular grid with a
ble number of axes is known as a tensor. We denote a tensor named “A”

g its name in italic but not bold font, an


this typeface: A. We identify the element of A at coordinates (i, j, k)
riting Ai,j,k .
(Goodfellow 2016)
Tensors
• A tensor is an array of numbers, that may have

• zero dimensions, and be a scalar

• one dimension, and be a vector

• two dimensions, and be a matrix

• or more dimensions.

(Goodfellow 2016)
ations have many useful properties that make mathematical
rtant operation on matrices is the transpose. The transpose of a
more convenient.
CHAPTER 2. LINEAR For
ALGEBRA example, matrix multiplication
mirror image of the matrix across a diagonal line, called the main
is

graphical
A(B depiction
+
Matrix Transpose
ning down and to the right, starting from its upper left corner. See
C)of =
thisAB
operation.
+ AC.We denote the transpose of a (2.6)
A> , and it is defined such that

(A> )i,j = Aj,i . (2.3)


A(BC) = (AB)C. (2.7)
an be thought of as matrices that contain only one column. The
aisvector
not iscommutative 2 (the
therefore a matrix with condition
A1,1 A1,2
only
3 one AB
row.

= BA
Sometimes we does not
alar multiplication. 33However, the
5)A = A
dot
A product between two
4 A > 1,1A A
2,1 3,1
A = A
2,1 2,2
1,2A A
2,2 3,2
A A
3,1 3,2

:
> >
x transpose
Figure 2.1: The y = yof x. (2.8)the
the matrix can be thought of as a mirror image across
main diagonal.
matrix product has a simple form:
the i-th column of A. When we need to explicitly identify the elements of a
> > as >
(AB) = B A an. array enclosed in square brackets: (2.9)
matrix, we write them

A1,1 A1,2
(2.2)
onstrate Eq. 2.8, by exploiting A2,1 the
A2,2fact that the value of
.
(Goodfellow 2016)
e defined,
duct A must A
of matrices have theBsame
and is anumber of columns
third matrix as B
C. In has
order
pe m⇥
ned, and B
Anmust is of
have theshape ⇥ p, thenofCcolumns
samen number is of shape
as B ⇥ p.
m has
atrix product
⇥ n and B isjust by placing
of shape n ⇥ two or more
p, then C ismatrices
of shapetogether,
m ⇥ p.
Matrix
C = AB. (Dot) Product
product just by placing two or more matrices together,
(2.4)
C = AB.
ration is defined by (2.4)
X
n is defined by=
Ci,j Ai,k Bk,j . (2.5)
X k
Ci,j = Ai,k Bk,j . (2.5)
andard product kof two matrices is not just a matrix containing
ndividual elements. Such an operation exists and is called the
td or Hadamard
product product,
of two and is
matrices is denoted
not just as
a matrix
A B. containing
m elements.
dual
between Suchx an
two vectors =
y of m
andoperation exists
the same • isncalledis the
and
dimensionality the
. We can think
Hadamard of the
product, matrix
and productas
is denoted CA = AB B.as computing
uct between row i of A and column j of B.
en two vectors x and y of the same dimensionality is thep
p
can think of the matrix
34 n Must
product C = AB as computing
match
etween row i of A and column j of B.
(Goodfellow 2016)
Inverse Matrices
PTER 2. LINEAR ALGEBRA

Identity Matrix
werful tool called matrix
2 inversion
1 0 0
3 that allows us to
for many values of A. 4 0 1 0 5
0 0 1
ersion, we first need to define the concept of an identity
x is a matrix that does not change any vector when we
Figure 2.2: Example identity matrix: This is I3 .
at matrix. We denote the identity matrix that preserves
n . Formally, IAn 2xR+ A , xand
n⇥n
2,1 1 2,2 2 + ··· + A x = b
2,n n 2 (2
n
8x 2 R , In x = x. ... (2.20)(2
Am,1 x1 + Am,2 x2 + · · · + Am,n xn = bm . (2
ity matrix is simple: all of the entries along the main
Matrix-vector product notation provides a more compact representation
the other
ations entries are zero. See Fig. 2.2 for an example.
of this form. (Goodfellow 2016)
aware that many more exist.
t of useful properties of the matrix product here, but
ough linear algebra notation to write down a system of linear
that many more exist.
Systems of Equations
ear algebra notation
Ax = b to write down a system (2.11)
of linear
a known matrix, b 2 Rm is a known vector, and x 2 Rn is a
riables we would
Ax =like (2.11)
b to solve for. Each element xi of x is one
iables. Each row of A and each element of b provide another
newrite
matrix, b 2 R
Eq. 2.11 as:
m is a known vector,
expands to and x 2 R n is a

we would like to solve for. Each element xi of x is one


A1,: x = b1 (2.12)
Each row of A and each element of b provide another
Eq. 2.11 as: A2,: x = b2 (2.13)
... (2.14)
A1,: x = b1 (2.12)
Am,: x = bm (2.15)
tly, as: A2,: x = b2 (2.13)
. .x.2 + · · · + A1,n xn = b1
A1,1 x1 + A1,2 (2.14)
(2.16)
35 (Goodfellow 2016)
Solving Systems of Equations

• A linear system of equations can have:

• No solution

• Many solutions

• Exactly one solution: this means multiplication by


the matrix is an invertible function

(Goodfellow 2016)
f the identity matrix is simple: all of the entries along the main
identity matrix is simple: all of the entries along the main
while all of the other entries are zero. See Fig. 2.2 for an example.
Matrix Inversion
all of the other entries are zero. See Fig. 2.2 for an example.
inverse of A is denoted as A 1 , and it is defined as the matrix
se of A is denoted as A , and it is defined as the matrix
1
• Matrix inverse:
A 1 A = In . (2.21)
1
A A = In . (2.21)
solve•Eq. 2.11 by
Solving a the following
system usingsteps:
an inverse:
Eq. 2.11 by the following steps:
Ax = b (2.22)
Ax = b= A 1 b
A 1 Ax (2.22)
(2.23)
1 11
A Ax ==AA bb
In x (2.23)
(2.24)
1
• I x = A 36 b
Numerically unstable, but useful for abstract
n (2.24)
analysis 36

(Goodfellow 2016)
Invertibility
• Matrix can’t be inverted if…

• More rows than columns

• More columns than rows

• Redundant rows/columns (“linearly dependent”,


“low rank”)

(Goodfellow 2016)
is given by
!1
X p

Norms
p
||x||p = |xi |
i

for p 2 R, p 1.
• Functions that measure
Norms, including the Lhow “large”
p norm, are afunctions
vector ismapping vect
values. On an intuitive level, the norm of a vector x measure
• the origintotoathe
Similar point x.
distance More rigorously,
between zero and thea norm
pointis any func
the following by
represented properties:
the vector

• f (x) = 0 ) x = 0

• f (x + y)  f (x) + f (y) (the triangle inequality)

• 8↵ 2 R, f (↵x) = |↵|f (x)

The L2 norm, with p = 2, is known as the Euclidean nor


Euclidean distance from the origin to the point identified by (Goodfellow 2016)
zero and elements that are small but nonzero. In these cases, we turn to a function
that grows at the same rate in all locations, but retains mathematical simplicity:
the L1 norm. The L1 norm may be simplified to
o measure the size of a vector. In machine learning, we
ectors using a function called
i
Norms
||x||1 =
X
a norm. Formally, the L
|xi |. (2.31)

p
• L
The L1 norm
norm is commonly used in machine learning when the difference between
APTERzero
2. LINEAR ALGEBRA
and nonzero elements is very important. 1 !
Every time an element of x moves
X
away from 0 by ✏, the L1 norm increases by ✏.
p
p
ause it increases
||x|| =
We sometimes measure
p
very slowly the the
near
|x |
size origin.
of thei vector by counting
In several machineits number
learning of nonzero
elements. Some authors refer to
lications, it is important to discriminate this function as the “L 0 norm,” but this is incorrect
between elements that are exactly
i
and terminology.
elements that The number
are small but of non-zero
nonzero. In entries in a we
these cases, vector
turnistonot a norm, because
a function
scaling the vector by ↵ does not change the number of nonzero
t grows at the same rate in all locations, but retains mathematical simplicity: entries. The L 1

• Most
norm
L1 norm. The Lpopular
is often
1 norm as norm:
used may a be L2 norm,
substitute
simplified fortothe p=2
number of nonzero entries.
One
p other norm that commonly arises in machine learning is the 1 norm,
the L norm, are functions mapping vectors to non-n
also known
• L1 as the
norm, max ||x||
p=1: norm.
X
1 = This|xnorm
i |. simplifies to the absolute
L
value of the
(2.31)
ive level, the norm of a vector x measures the distan
element with the largest magnitude i in the vector,
L1 norm is commonly
• Max norm, used in machine
infinite p: ||x||learning when the difference between
ont More rigorously, a Every
norm i timeis
an any
elementfunction f that
(2.32)
andx.
1 = max |xi |.
nonzero elements is very important. of x moves
ties: Sometimes we may also wish to measure the size of a matrix. In the context
y from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the
of deep learning, the size
mostofcommon
the vector by to
way counting
do thisitsis number
with theof otherwise
nonzero obscure
ments.Frobenius
Some authors
norm refer to this function as the “L 0 norm,” but this is incorrect
sX (Goodfellow 2016)
minology. The number of non-zero entries in a vector is not a norm, because
1 > n
A=A . (2.35)
esmachine
often arise learning
when the algorithm
entries arein terms of arbitrary
generated matrices,
by some function of
at
endoes
sive (and
arise not
when depend
less on the order
descriptive)
the entries of the arguments.
arealgorithm
generated by someFor
by restricting example,
functionsome of
es
A
ance
Special Matrices and Vectors
distance
not depend
= A
measurements,
on the order
because
with A
distance
giving
ofi,jthe
functions
the distance
arguments.
are
Forfrom
symmetric.
point
example,
ices measurements,
need be square.with It is A i,j giving the distance from point
i,j j,i
possible to construct a rectangular
is
= aAvector with unit
because norm:functions are symmetric.
distance
quare •diagonal matrices do not have inverses but it is still
j,i
Unit vector:
ector
them with cheaply.unit For
norm:
||x||a2 =non-square
1. diagonal matrix D,(2.36) the
scaling each element of x, and either concatenating some
||x||2 = 1. (2.36)
is taller than it is wide, or discarding
nd a vector y are orthogonal to each other if x y = 0. If both
some
> of the last
ero is wider
D norm, this than
means it that
is tall.
they are at a 90 degree angle to each
• Symmetric Matrix:
vector
is any
most are orthogonal
ymatrix
n vectors to each
thatbeismutually
may equal other
owniftranspose:
toorthogonal
its x >
with = 0. If norm.
y nonzero both
eorm,
not this
only means that
orthogonal >
they
but are
also at
have a 90
unit degree
norm, angle
we to
call each
them
A = A .
n vectors may be mutually orthogonal with nonzero norm. (2.35)
LGEBRA
only
ariseorthogonal
nmatrix when
is but also
the entries
a square matrix arehave
whose unit
generated
rows are norm,
by some
mutuallyweorthonormal
call them
function of
• Orthogonal matrix:
ns
s notare depend
mutuallyon orthonormal:
the order of the arguments. For example,
nce
ix ismeasurements,
a squareAmatrix>
A =with whose
AA giving
> i,j rows
A = I. arethe distanceorthonormal
mutually from (2.37)
point
Aj,i because
=e mutually A distance
orthonormal:
1
=A > functions are symmetric.
41 , (2.38)
(Goodfellow 2016)
> >
atrix as an array of elements.
dely used kinds of matrix decomposition is called eigen-
hexample Eigendecomposition
we decompose
of the effect ofaeigenvectors
matrix
h two orthonormal eigenvectors, v (1)
into a set of
and eigenvalues. eigenvectors
Here, we have
with eigenvalue 1 and v
(2)
and
with
Left) We plot the set of all unit vectors u 2 R2 as a unit circle. (Right)
of all points Au. By observing the way that A distorts the unit circle, we
square
cales
ALGEBRAspacematrix
in A
direction vis
(i) a non-zero vector v such that multipli-
by i .
the scale of v:
Eigenvector
form a• matrix and
V with one eigenvalue:
eigenvector per column: V = [v (1) , . . . ,
, we can concatenate
Av the eigenvalues
= v. to form a vector = [ 1 , . . . , (2.39)
on exists, but may involve complex rather than real numbers.
ndecomposition of A is then given by
• Eigendecomposition of a diagonalizable matrix:
ook, we usually need to decompose only a specific class of
assimple
the eigenvalue corresponding
diag( 1
decomposition. Specifically,
A = V )V . to this
everyeigenvector.
real (2.40) (One
symmetric
vector into
posed • such
Every
anthat >
realv symmetric > , only
but we
v matrix
A = using
expression area usually
has concerned
real,eigenvectors
real-valued orthogonal
en that constructing matrices with specific eigenvalues and eigenvec-
. stretcheigendecomposition:
to space in desired directions. However, we often want to
> eigenvectors. Doing so can help us
rices into their eigenvalues
or of A, thenAso = is
Q⇤Q and ,
any rescaled vector sv for s 2 R,(2.41)
s 6= 0.
ain properties of the matrix, much as decomposing an integer into
he
rs same
gonal
can eigenvalue.
matrix
help us composed
understand Forbehavior
the ofthis reason,
eigenvectors weofusually
of that integer. A, andonly
⇤ islook
a
eigenvalue
matrix ⇤i,i is associated
can be decomposed with the
into eigenvalues eigenvectorIninsome
and eigenvectors. column i(Goodfellow 2016)
Effect of Eigenvalues
Before multiplication After multiplication
3 3

λ1 v(1)
2 2

1 1
v(1) v(1)

0 0
x1

x1′
λ1 v(1)
v(1) v(1)
−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x( x(′

(Goodfellow 2016)
LINEAR ALGEBRA

Singular Value Decomposition


ly applicable. Every real matrix has a singular value decomposition,
is not true of the eigenvalue decomposition. For example, if a matrix
, the eigendecomposition is not defined, and we must use a singular
osition instead.
at the eigendecomposition involves analyzing a matrix A to discover
f eigenvectors and a vector of eigenvalues such that we can rewrite

1
• = V diag(
Similar to eigendecomposition
A )V . (2.42)

ular value decomposition is similar, except this time we will write A


• More general; matrix need not be square
of three matrices:

A = U DV > . (2.43)

hat A is an m ⇥ n matrix. Then U is defined to be an m ⇥ m matrix,


m ⇥ n matrix, and V to be an n ⇥ n matrix.
(Goodfellow 2016)
rithms for computing the pseudoinverse are not based on this defini-
When A has more columns than r
er the formula

Moore-Penrose Pseudoinverse
+ + >
(2.47)
pseudoinverse provides one of the ma
A =VD U ,

nd V are the singular value decomposition of A, and the pseudoinverse


he
onal solution
matrix D is obtained xby =
taking A
the + y
reciprocal
taking the transpose of the resulting matrix.
with minima
of its non-zero

olutions.
as more columns than rows, then solving a linear equation using the
• If the equation has:
provides one of the many possible solutions. Specifically, it provides
x = A+ y with minimal Euclidean norm ||x||2 among all possible
When •ExactlyA has more
one solution: this is therows
same as thethaninverse.colu

n this
using •Nocase,
the pseudoinverse using
us the
thewhich
x for pseudoinverse
as more rows than columns, it is possible for there to be no solution.
solution:gives
this gives
the us solution
Ax iswith the as
as close
in terms of smallest
Euclideanerror
norm ||Ax y|| .
possible to y in terms of Euclidean n
2

Many solutions: this gives us the solution with the



e Trace Operator
smallest norm of x.
rator gives the sum of all of the diagonal entries of a matrix:
(Goodfellow 2016)
eudoinverse allows us to make some headway in these
Computing the Pseudoinverse
of A is defined as a matrix
+ > 1 >
A = lim (A A + ↵I) A . (2.46)
↵&0

mputing the pseudoinverse are not based on this defini-


la The SVD allows the computation of the pseudoinverse:

+ + >
A =VD U , (2.47)

singular value decomposition of A, and the pseudoinverse


Take reciprocal of non-zero entries
D is obtained by taking the reciprocal of its non-zero
ranspose of the resulting matrix.
mns than rows, then solving a linear equation using the
of the many possible solutions. Specifically, it provides
(Goodfellow 2016)
pression using many useful identities. For example, the trace
perator
nt to the transpose operator:

Tr(A) = Tr(A Trace


sum of all of the diagonal). entries of a matrix:
>
(2.50)
X
square matrix
Tr(A) = composed
Ai,i . of many factors is also invariant
(2.48)to
ctor into the first
i
position, if the shapes of the corresponding
resulting product to be defined:
ful for a variety
Tr(ABC) of reasons.
= Tr(CAB) Some operations that(2.51)
= Tr(BCA) are
sorting to summation notation can be specified using
n 46
Y n
Y1
(i) (n) (i)
Tr( F ) = Tr(F F ). (2.52)
i=1 i=1

cyclic permutation holds even if the resulting product has a


r example, for A 2 Rm⇥n and B 2 Rn⇥m , we have
(Goodfellow 2016)
Learning linear algebra

• Do a lot of practice problems

• Start out with lots of summation signs and indexing


into individual entries

• Eventually you will be able to mostly use matrix


and vector product notation quickly and easily

(Goodfellow 2016)

You might also like