You are on page 1of 97

Linear Dimensionality Reduction:

Principal Component Analysis

Piyush Rai

Machine Learning (CS771A)

Sept 2, 2016

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 1


Dimensionality Reduction

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 2


Dimensionality Reduction
Usually considered an unsupervised learning method

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3


Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3


Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data

Also useful for “feature learning” or “representation learning” (learning a better, often
smaller-dimensional, representation of the data), e.g.,

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3


Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data

Also useful for “feature learning” or “representation learning” (learning a better, often
smaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3


Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data

Also useful for “feature learning” or “representation learning” (learning a better, often
smaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3


Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data

Also useful for “feature learning” or “representation learning” (learning a better, often
smaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)

Can be used for speeding up learning algorithms

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3


Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data

Also useful for “feature learning” or “representation learning” (learning a better, often
smaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)

Can be used for speeding up learning algorithms


Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
Curse of Dimensionality

Exponentially large # of examples required to “fill up” high-dim spaces

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 4


Curse of Dimensionality

Exponentially large # of examples required to “fill up” high-dim spaces

Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 4


Curse of Dimensionality

Exponentially large # of examples required to “fill up” high-dim spaces

Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization

Dimensionality reduction is a way to beat the curse of dimensionality

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 4


Linear Dimensionality Reduction
A projection matrix U = [u 1 u 2 . . . u K ] of size D × K defines K linear projection directions, each
u k ∈ RD , for the D dim. data (assume K < D)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5


Linear Dimensionality Reduction
A projection matrix U = [u 1 u 2 . . . u K ] of size D × K defines K linear projection directions, each
u k ∈ RD , for the D dim. data (assume K < D)
Can use U to transform x n ∈ RD into z n ∈ RK as shown below

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5


Linear Dimensionality Reduction
A projection matrix U = [u 1 u 2 . . . u K ] of size D × K defines K linear projection directions, each
u k ∈ RD , for the D dim. data (assume K < D)
Can use U to transform x n ∈ RD into z n ∈ RK as shown below

Note that z n = U> x n = [u > > >


1 x n u 2 x n . . . u K x n ] is a K -dim projection of x n

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5


Linear Dimensionality Reduction
A projection matrix U = [u 1 u 2 . . . u K ] of size D × K defines K linear projection directions, each
u k ∈ RD , for the D dim. data (assume K < D)
Can use U to transform x n ∈ RD into z n ∈ RK as shown below

Note that z n = U> x n = [u > > >


1 x n u 2 x n . . . u K x n ] is a K -dim projection of x n

z n ∈ RK is also called low-dimensional “embedding” of x n ∈ RD

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5


Linear Dimensionality Reduction
X = [x 1 x 2 . . . x N ] is D × N matrix denoting all the N data points

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6


Linear Dimensionality Reduction
X = [x 1 x 2 . . . x N ] is D × N matrix denoting all the N data points
Z = [z 1 z 2 . . . z N ] is K × N matrix denoting embeddings of data points

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6


Linear Dimensionality Reduction
X = [x 1 x 2 . . . x N ] is D × N matrix denoting all the N data points
Z = [z 1 z 2 . . . z N ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6


Linear Dimensionality Reduction
X = [x 1 x 2 . . . x N ] is D × N matrix denoting all the N data points
Z = [z 1 z 2 . . . z N ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below

How do we learn the “best” projection matrix U?

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6


Linear Dimensionality Reduction
X = [x 1 x 2 . . . x N ] is D × N matrix denoting all the N data points
Z = [z 1 z 2 . . . z N ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below

How do we learn the “best” projection matrix U?


What criteria should we optimize for when learning U?

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6


Linear Dimensionality Reduction
X = [x 1 x 2 . . . x N ] is D × N matrix denoting all the N data points
Z = [z 1 z 2 . . . z N ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below

How do we learn the “best” projection matrix U?


What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6


Principal Component Analysis

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 7


Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8


Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8


Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8


Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8


Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error

Can also be seen as changing the basis in which the data is represented (and transforming the
features such that new features become decorrelated)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8


Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error

Can also be seen as changing the basis in which the data is represented (and transforming the
features such that new features become decorrelated)

Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
PCA as Maximizing Variance

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 9


Variance Captured by Projections
Consider projecting x n ∈ RD on a one-dim subspace defined by u 1 ∈ RD
Projection/embedding of x n along a one-dim subspace u 1 = u >
1 x n (location of the green point
along the purple line representing u 1 )

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10


Variance Captured by Projections
Consider projecting x n ∈ RD on a one-dim subspace defined by u 1 ∈ RD
Projection/embedding of x n along a one-dim subspace u 1 = u >
1 x n (location of the green point
along the purple line representing u 1 )

Mean of projections of all the data: u> > 1


x n) = u>
1
PN PN
N n=1 1 x n = u1 ( N n=1 1 µ

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10


Variance Captured by Projections
Consider projecting x n ∈ RD on a one-dim subspace defined by u 1 ∈ RD
Projection/embedding of x n along a one-dim subspace u 1 = u >
1 x n (location of the green point
along the purple line representing u 1 )

Mean of projections of all the data: N1 Nn=1 u > > 1 N >


P P
1 x n = u1 ( N n=1 x n ) = u 1 µ

Variance of the projected data (“spread” of the green points)


N N
1 X > >
2 1 X > 2 >
u1 x n − u1 µ = {u (x n − µ)} = u 1 Su 1
N n=1 N n=1 1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10


Variance Captured by Projections
Consider projecting x n ∈ RD on a one-dim subspace defined by u 1 ∈ RD
Projection/embedding of x n along a one-dim subspace u 1 = u >
1 x n (location of the green point
along the purple line representing u 1 )

Mean of projections of all the data: N1 Nn=1 u > > 1 N >


P P
1 x n = u1 ( N n=1 x n ) = u 1 µ

Variance of the projected data (“spread” of the green points)


N N
1 X > >
2 1 X > 2 >
u1 x n − u1 µ = {u (x n − µ)} = u 1 Su 1
N n=1 N n=1 1

S is the D × D data covariance matrix: S = N1 Nn=1 (x n − µ)(x n − µ)> . If data already centered
P

(µ = 0) then S = N1 Nn=1 x n x > 1


X> X
P
n = N

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10


Direction of Maximum Variance

We want u 1 s.t. the variance of the projected data is maximized


arg max u >
1 Su 1
u1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 11


Direction of Maximum Variance

We want u 1 s.t. the variance of the projected data is maximized


arg max u >
1 Su 1
u1

To prevent trivial solution (max var. = infinite), assume ||u 1 || = 1 = u >


1 u1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 11


Direction of Maximum Variance

We want u 1 s.t. the variance of the projected data is maximized


arg max u >
1 Su 1
u1

To prevent trivial solution (max var. = infinite), assume ||u 1 || = 1 = u >


1 u1
We will find u 1 by solving the following constrained opt. problem

arg max u > >


1 Su 1 + λ1 (1 − u 1 u 1 )
u1

where λ1 is a Lagrange multiplier


Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 11
Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12


Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1
Thus u 1 is an eigenvector of S (with corresponding eigenvalue λ1 )

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12


Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1
Thus u 1 is an eigenvector of S (with corresponding eigenvalue λ1 )

But which of S’s (D possible) eigenvectors it is?

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12


Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1
Thus u 1 is an eigenvector of S (with corresponding eigenvalue λ1 )

But which of S’s (D possible) eigenvectors it is?


Note that since u >
1 u 1 = 1, the variance of projected data is

u>
1 Su 1 = λ1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12


Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1
Thus u 1 is an eigenvector of S (with corresponding eigenvalue λ1 )

But which of S’s (D possible) eigenvectors it is?


Note that since u >
1 u 1 = 1, the variance of projected data is

u>
1 Su 1 = λ1

Var. is maximized when u 1 is the (top) eigenvector with largest eigenvalue

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12


Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1
Thus u 1 is an eigenvector of S (with corresponding eigenvalue λ1 )

But which of S’s (D possible) eigenvectors it is?


Note that since u >
1 u 1 = 1, the variance of projected data is

u>
1 Su 1 = λ1

Var. is maximized when u 1 is the (top) eigenvector with largest eigenvalue


The top eigenvector u 1 is also known as the first Principal Component (PC)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12


Direction of Maximum Variance

The objective function: arg maxu 1 u > >


1 Su 1 + λ1 (1 − u 1 u 1 )

Taking the derivative w.r.t. u 1 and setting to zero gives


Su 1 = λ1 u 1
Thus u 1 is an eigenvector of S (with corresponding eigenvalue λ1 )

But which of S’s (D possible) eigenvectors it is?


Note that since u >
1 u 1 = 1, the variance of projected data is

u>
1 Su 1 = λ1

Var. is maximized when u 1 is the (top) eigenvector with largest eigenvalue


The top eigenvector u 1 is also known as the first Principal Component (PC)

Other directions can also be found likewise (with each being orthogonal to all previous ones) using
the eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
Principal Component Analysis

Steps in Principal Component Analysis

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


Principal Component Analysis

Steps in Principal Component Analysis


1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


Principal Component Analysis

Steps in Principal Component Analysis


1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1
S= XX> (note: X assumed D × N here)
N

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


Principal Component Analysis

Steps in Principal Component Analysis


1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1
S= XX> (note: X assumed D × N here)
N
Do an eigendecomposition of the covariance matrix S

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


Principal Component Analysis

Steps in Principal Component Analysis


1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1
S= XX> (note: X assumed D × N here)
N
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {u k }Kk=1 with eigenvalues {λk }Kk=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


Principal Component Analysis

Steps in Principal Component Analysis


1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1
S= XX> (note: X assumed D × N here)
N
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {u k }Kk=1 with eigenvalues {λk }Kk=1
The final K dim. projection/embedding of data is given by
Z = U> X
where U = [u 1 . . . u K ] is D × K and embedding matrix Z is K × N

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


Principal Component Analysis

Steps in Principal Component Analysis


1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1
S= XX> (note: X assumed D × N here)
N
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {u k }Kk=1 with eigenvalues {λk }Kk=1
The final K dim. projection/embedding of data is given by
Z = U> X
where U = [u 1 . . . u K ] is D × K and embedding matrix Z is K × N
1 >
A word about notation: If X is N × D, then S = NX X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13


PCA as Minimizing the
Reconstruction Error

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 14


Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u 1 , u 2 , . . . , u D , each u d ∈ RD

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15


Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u 1 , u 2 , . . . , u D , each u d ∈ RD
We can represent each data point x n ∈ RD exactly using this new basis
XD
xn = znk u k
k=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15


Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u 1 , u 2 , . . . , u D , each u d ∈ RD
We can represent each data point x n ∈ RD exactly using this new basis
XD
xn = znk u k
k=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15


Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u 1 , u 2 , . . . , u D , each u d ∈ RD
We can represent each data point x n ∈ RD exactly using this new basis
XD
xn = znk u k
k=1

Denoting z n = [zn1 zn2 . . . znD ]> , U = [u 1 u 2 . . . u D ], and using U> U = ID

x n = Uz n and z n = U> x n

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15


Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u 1 , u 2 , . . . , u D , each u d ∈ RD
We can represent each data point x n ∈ RD exactly using this new basis
XD
xn = znk u k
k=1

Denoting z n = [zn1 zn2 . . . znD ]> , U = [u 1 u 2 . . . u D ], and using U> U = ID

x n = Uz n and z n = U> x n

Also note that each component of vector z n is znk = u >


k xn
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15
Reconstruction of Data from Projections
Reconstruction of x n from z n will be exact if we use all the D basis vectors

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16


Reconstruction of Data from Projections
Reconstruction of x n from z n will be exact if we use all the D basis vectors
PK
Will be approximate if we only use K < D basis vectors: x n ≈ k=1 znk u k

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16


Reconstruction of Data from Projections
Reconstruction of x n from z n will be exact if we use all the D basis vectors
PK
Will be approximate if we only use K < D basis vectors: x n ≈ k=1 znk u k
Let’s use K = 1 basis vector. Then the one-dim embedding of x n is
z n = u>
1 xn (note: this will just be a scalar)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16


Reconstruction of Data from Projections
Reconstruction of x n from z n will be exact if we use all the D basis vectors
PK
Will be approximate if we only use K < D basis vectors: x n ≈ k=1 znk u k
Let’s use K = 1 basis vector. Then the one-dim embedding of x n is
z n = u>
1 xn (note: this will just be a scalar)
We can now try “reconstructing” x n from its embedding z n as follows
x̃ n = u 1 z n = u 1 u >
1 xn

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16


Reconstruction of Data from Projections
Reconstruction of x n from z n will be exact if we use all the D basis vectors
PK
Will be approximate if we only use K < D basis vectors: x n ≈ k=1 znk u k
Let’s use K = 1 basis vector. Then the one-dim embedding of x n is
z n = u>
1 xn (note: this will just be a scalar)
We can now try “reconstructing” x n from its embedding z n as follows
x̃ n = u 1 z n = u 1 u >
1 xn
Total error or “loss” in reconstructing all the data points
N N
>
X 2
X 2
L(u 1 ) = ||x n − x̃ n || = ||x n − u 1 u 1 x n ||
n=1 n=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16


Direction with Best Reconstruction
We want to find u 1 that minimizes the reconstruction error
N
>
X 2
L(u 1 ) = ||x n − u 1 u 1 x n ||
n=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17


Direction with Best Reconstruction
We want to find u 1 that minimizes the reconstruction error
N
>
X 2
L(u 1 ) = ||x n − u 1 u 1 x n ||
n=1
N
> > > > > >
X
= {x n x n + (u 1 u 1 x n ) (u 1 u 1 x n ) − 2x n u 1 u 1 x n }
n=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17


Direction with Best Reconstruction
We want to find u 1 that minimizes the reconstruction error
N
>
X 2
L(u 1 ) = ||x n − u 1 u 1 x n ||
n=1
N
> > > > > >
X
= {x n x n + (u 1 u 1 x n ) (u 1 u 1 x n ) − 2x n u 1 u 1 x n }
n=1
N
> > >
X
= −u 1 x n x n u 1 (using u 1 u 1 = 1 and ignoring constants w.r.t. u 1 )
n=1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17


Direction with Best Reconstruction
We want to find u 1 that minimizes the reconstruction error
N
>
X 2
L(u 1 ) = ||x n − u 1 u 1 x n ||
n=1
N
> > > > > >
X
= {x n x n + (u 1 u 1 x n ) (u 1 u 1 x n ) − 2x n u 1 u 1 x n }
n=1
N
> > >
X
= −u 1 x n x n u 1 (using u 1 u 1 = 1 and ignoring constants w.r.t. u 1 )
n=1

Thus the problem is equivalent to the following maximization

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17


Direction with Best Reconstruction
We want to find u 1 that minimizes the reconstruction error
N
>
X 2
L(u 1 ) = ||x n − u 1 u 1 x n ||
n=1
N
> > > > > >
X
= {x n x n + (u 1 u 1 x n ) (u 1 u 1 x n ) − 2x n u 1 u 1 x n }
n=1
N
> > >
X
= −u 1 x n x n u 1 (using u 1 u 1 = 1 and ignoring constants w.r.t. u 1 )
n=1

Thus the problem is equivalent to the following maximization


N
!
> 1 X > >
arg max u 1 xnxn u 1 = arg max u 1 Su 1
u 1 :||u 1 ||2 =1 N n=1 u 1 :||u 1 ||2 =1

where S is the covariance matrix of the data (data assumed centered)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17


Direction with Best Reconstruction
We want to find u 1 that minimizes the reconstruction error
N
>
X 2
L(u 1 ) = ||x n − u 1 u 1 x n ||
n=1
N
> > > > > >
X
= {x n x n + (u 1 u 1 x n ) (u 1 u 1 x n ) − 2x n u 1 u 1 x n }
n=1
N
> > >
X
= −u 1 x n x n u 1 (using u 1 u 1 = 1 and ignoring constants w.r.t. u 1 )
n=1

Thus the problem is equivalent to the following maximization


N
!
> 1 X > >
arg max u 1 xnxn u 1 = arg max u 1 Su 1
u 1 :||u 1 ||2 =1 N n=1 u 1 :||u 1 ||2 =1

where S is the covariance matrix of the data (data assumed centered)


It’s the same objective that we had when we maximized the variance

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17


How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC u k

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18


How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC u k
The “left-over” variance will therefore be
D
X
λk
k=K +1

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18


How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC u k
The “left-over” variance will therefore be
D
X
λk
k=K +1

Can choose K by looking at what fraction of variance is captured by the first K PCs

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18


How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC u k
The “left-over” variance will therefore be
D
X
λk
k=K +1

Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstruction
error vs number of PC

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18


How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC u k
The “left-over” variance will therefore be
D
X
λk
k=K +1

Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstruction
error vs number of PC

Can also use other criteria such as AIC/BIC (or more advanced probabilistic approaches to PCA
using nonparametric Bayesian methods)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18
PCA as Matrix Factorization

Note that PCA represents each x n as x n = Uz n

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19


PCA as Matrix Factorization

Note that PCA represents each x n as x n = Uz n


When using only K < D components, x n ≈ Uz n

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19


PCA as Matrix Factorization

Note that PCA represents each x n as x n = Uz n


When using only K < D components, x n ≈ Uz n
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19


PCA as Matrix Factorization

Note that PCA represents each x n as x n = Uz n


When using only K < D components, x n ≈ Uz n
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N

The above approx. is equivalent to a low-rank matrix factorization of X

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19


PCA as Matrix Factorization

Note that PCA represents each x n as x n = Uz n


When using only K < D components, x n ≈ Uz n
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N

The above approx. is equivalent to a low-rank matrix factorization of X


Also closely related to Singular Value Decomposition (SVD); see next slide

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

U is D × K matrix with top K left singular vectors of X

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

U is D × K matrix with top K left singular vectors of X


Λ is a K × K diagonal matrix (with top K singular values)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

U is D × K matrix with top K left singular vectors of X


Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

U is D × K matrix with top K left singular vectors of X


Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

U is D × K matrix with top K left singular vectors of X


Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X − UΛV> ||

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20


PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>

U is D × K matrix with top K left singular vectors of X


Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X − UΛV> ||

PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
XK
xn ≈ znk u k or X ≈ UZ
k=1
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basis
vectors represent the “Dictionary”

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21


PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
XK
xn ≈ znk u k or X ≈ UZ
k=1
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basis
vectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21


PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
XK
xn ≈ znk u k or X ≈ UZ
k=1
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basis
vectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”

Each document in a collection as a comb. of a small no of “topics”

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21


PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
XK
xn ≈ znk u k or X ≈ UZ
k=1
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basis
vectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”

Each document in a collection as a comb. of a small no of “topics”


Each gene-expression sample as a comb. of a small no of “genetic pathways”

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21


PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
XK
xn ≈ znk u k or X ≈ UZ
k=1
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basis
vectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”

Each document in a collection as a comb. of a small no of “topics”


Each gene-expression sample as a comb. of a small no of “genetic pathways”
The “eigenfaces”, “topics”, “genetic pathways”, etc. are the “basis vectors”, which can be learned
from data using PCA/SVD or other similar methods
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21
PCA: Example

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 22


PCA: Example

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 23


PCA: Limitations and Extensions
A linear projection method

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24


PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24


PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24


PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)

Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserve
class separation, e.g., when doing classification)

Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24


PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)

Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserve
class separation, e.g., when doing classification)

PCA relies on eigendecomposition of an D × D covariance matrix


Can be slow if done naı̈vely. Takes O(D 3 ) time
Many faster methods exists (e.g., Power Method)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24

You might also like