You are on page 1of 23

Chapter 10.

Dimensionality Reduction with PCA

0
Equivalent to 1 slot. Lecturers can use extra materials for other slots in the chater
ThienNV Chapter 10. Dimensionality Reduction with PCA 1 / 23
Chapter 10. Dimensionality Reduction with PCA

1 10.1 Problem Setting


2 10.2 Maximum Variance Perspective
3 10.3 Projection Perspective
4 10.4 Eigenvector Computation and Low-Rank Approximations
5 10.5 PCA in High Dimensions
6 10.6 Key Steps of PCA in Practice
ThienNV Chapter 10. Dimensionality Reduction with PCA 2 / 23
10.1 Problem Setting
Consider a data set X = {x1 , . . . , xN }, xn ∈ RD with mean 0 and the data
covariance matrix:
1 N T
S= ∑ xn xn . (1)
N n=1
Assume that there exists a low-dimensional compressed representation
(code) of xn
zn = B T xn ∈ RM , (2)
where
B ∶= [b1 ⋯bM ] ∈ RD×M the projection matrix is orthonormal. (3)

We seek an M-dimensional subspace U of RD onto which we project the


data. Denote the projected data by x̃n ∈ U, and their coordinates by zn .
Our aim is to find projections x̃n so that they are as similar to the original
data xn and minimize the loss due to compression.
ThienNV Chapter 10. Dimensionality Reduction with PCA 3 / 23
10.2 Maximum Variance Perspective
Direction with Maximal Variance:
Seek a single vector b1 ∈ RD that maximizes the variance of the projected
data, that is we aim to maximize the variance of the first coordinate z1 of
z ∈ RM so that
1 N 2 1 N T 2 1 N T T
V1 ∶= V [z1 ] = ∑ z1n = ∑ (b1 xn ) = ∑ b1 xn xn b1 (4)
N n=1 N n=1 N n=1
1 N
= b1T ( T T
∑ xn xn ) b1 = b1 Sb1 .
N n=1
We lead to a constrained optimization problem
max b1T Sb1 (5)
b1

subject to ∥b1 ∥2 = 1 (6)


which has the Lagrangian
L(b1 , λ) = b1T Sb1 − λ1 (1 − b1T b1 ) (7)
ThienNV Chapter 10. Dimensionality Reduction with PCA 4 / 23
Setting the partial derivatives of L to 0
∂L
= 2b1T S − 2λ1 b1T = 0, (8)
∂b1
∂L
= 1 − b1T b1 = 0, (9)
∂λ1
we obtain

Sb1 = λ1 b1 , b1T b1 = 1. (10)

Hence b1 is an eigenvector corresponding to the eigenvalue λ1 of the data


covariance matrix S, and

V1 = b1T Sb1 = λ1 . (11)

λ1 is called the first principal component.

ThienNV Chapter 10. Dimensionality Reduction with PCA 5 / 23


M-dimensional Subspace

Assume that we have found the first m − 1 principal components


b1 , . . . , bm−1 as the m − 1 eigenvectors of S that are associated with the
largest m − 1 eigenvalues λ1 , . . . , λm−1 .
Since S is symmetric, these eigenvectors form an orthonormal basis of an
(m − 1)-dimensional subspace of RD . We find principal components that
compress the remaining information.
Set X ∶= [x1 . . . xN ] ∈ RD×N and Bm−1 ∶= ∑m−1 T
i=1 bi bi the projection matrix
that projects onto the subspace spanned by b1 , . . . , bm−1 . The new data
matrix
m−1
X̂ ∶= X − ∑ bi biT X = X − Bm−1 X . (12)
i=1

ThienNV Chapter 10. Dimensionality Reduction with PCA 6 / 23


M-dimensional Subspace

Let Ŝ be the data covariance matrix of the transformed dataset


Xˆ ∶= {x̂1 , . . . , x̂N }.
To find the mth principal component, we solve a constrained optimization
problem

1 N 2 1 N T 2 T
max [Vm = Vm [zm ] = ∑ z = ∑ (b x̂n ) = bm Ŝbm ] (13)
bm N n=1 mn N n=1 m
subject to ∥bm ∥2 = 1.

The optimal solution bm is the eigenvector of Ŝ that is associated with the


largest eigenvalue of Ŝ.

ThienNV Chapter 10. Dimensionality Reduction with PCA 7 / 23


M-dimensional Subspace

We shall show that the sets of eigenvectors of S and Ŝ are identical.


Consider and eigenvector bi of S, i.e., Sbi = λi bi . We have
1 1
Ŝbi = X̂ X̂ T bi = (X − Bm−1 X )(X − Bm−1 X )T bi
N N
= (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi . (14)

If i ≥ m, then bi is orthogonal to the first m − 1 principal components and


Bm−1 bi = 0. Hence

Ŝbi = (S − Bm−1 S)bi = Sbi = λi bi , (15)

which shows that bm is an eigenvector of Ŝ and λm is the largest


eigenvalue of Ŝ and λm is the mth largest eigenvalue of S, and both have
the associated eigenvector bm .

ThienNV Chapter 10. Dimensionality Reduction with PCA 8 / 23


M-dimensional Subspace

If i < m, then bi is a basis vector of the principal subspace onto which


Bm−1 projects. Since b1 , . . . , bm−1 are an ONB of this principal subspace,
we obtain Bm−1 bi = bi . Thus,

Ŝbi = (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi = 0 = 0bi . (16)

This means that b1 , . . . , bm−1 are also eigenvectors of Ŝ, but they are
associated with eigenvalue 0 so that b1 , . . . , bm−1 span the null space of Ŝ.

ThienNV Chapter 10. Dimensionality Reduction with PCA 9 / 23


M-dimensional Subspace
By (15), the variance of the data projected onto the mth principal
component is
T T
Vm = bm Sbm = λm bm bm = λ m . (17)
To find an M-dimensional subspace of RD that retains as much
information as possible, PCA tells us to choose the columns of the matrix
B in (3) as the M eigenvectors of the data covariance matrix S that are
associated with the M largest eigenvalues. The maximum amount of
variance PCA can capture with the first M principal components is
M
VM = ∑ λm (18)
m=1
where the λm are the M largest eigenvalues of the data covariance matrix
S. Consequently, the variance lost by data compression via PCA is
D
JM ∶= ∑ λi = VD − VM . (19)
j=M+1

ThienNV Chapter 10. Dimensionality Reduction with PCA 10 / 23


10.3 Projection Perspective

Setting and Objective We will derive PCA as an algorithm that directly


minimizes the average reconstruction error.
Assume that B = {b1 , . . . , bD } is an orthonormal basis of RD . Any x ∈ RD
can be written as
D M D
x = ∑ ξd bd = ∑ ξm bm + ∑ ξj bj where ξd ∈ R. (20)
d=1 m=1 j=M+1

Find a vectors x̃ in an M-dimensional subspace U of RD so that


M
x̃ = ∑ zm bm (21)
m=1

is as similar to x as possible.

ThienNV Chapter 10. Dimensionality Reduction with PCA 11 / 23


Without loss of generality, assume that the dataset X = {x1 , . . . , xN } is
centered at 0, E [X ] = 0. We interested in finding the best linear
projection of X onto U the principal subspace and ONB b1 , . . . , bM of U.
The projections of the data points
M
x̃n ∶= ∑ zmn bm = Bzn ∈ RD . (22)
m=1

The reconstruction error

1 N 2
JM ∶= ∑ ∥xn − x̃n ∥ . (23)
N n=1

To find the coordinates zn = [z1n , . . . , zMn ]T ∈ RM of x̃n w.r.t the basis


(b1 , . . . , bM ) and the ONB of the principal subspace, we follow two-step.
First, we optimize the coordinate zn for a given ONB (b1 , . . . , bM );
second, we find the optimal ONB.

ThienNV Chapter 10. Dimensionality Reduction with PCA 12 / 23


Finding Optimal Coordinates
We have
∂JM ∂JM ∂ x̃n
= ,
∂zin ∂ x̃n ∂zin
∂JM 2
= − (xn − x̃n )T ∈ R1×D (24)
∂ x̃n N
M
∂ x̃n ∂
= ( ∑ zmn bm ) = bi .
∂zin ∂zin m=1
Hence
M T
∂JM 2 2
= − (xn − x̃n )T bi = − (xn − ∑ zmn bm ) bi (25)
∂zin N N m=1
2 T T 2 T
= − (xn bi − zin bi bi ) = − (xn bi − zin ). (26)
N N
Setting this to 0 obtain
zin = xnT bi = biT xn . (27)
ThienNV Chapter 10. Dimensionality Reduction with PCA 13 / 23
Finding the Basis of the Principal Subspace
To determine the basis vectors b1 , . . . , bM of the principal subspace, we
rephrase the loss function (23) using the results we have so far. To
reformulate the loss function, we exploit our results from before
M M M
T
x̃n = ∑ zmn bm = ∑ (xnT bm )bm = ( ∑ bm bm ) xn . (28)
m=1 m=1 m=1

(27)
D D D
xn = ∑ zdn bd = ∑ (xnT bd )bd = ( ∑ bd bdT ) xn
d=1 d=1 d=1
M ⎛ D
T T⎞
= ( ∑ bm bm ) xn + ∑ bj bj xn . (29)
m=1 ⎝j=M+1 ⎠
Hence the difference vector between the original data point and its
projection is
⎛ D T⎞
D
T
xn − x̃n = ∑ bj bj xn = ∑ (xn bj )bj . (30)
⎝j=M+1 ⎠ j=M+1

ThienNV Chapter 10. Dimensionality Reduction with PCA 14 / 23


Reformulate the loss function
N X X 2
1 N 2 (30) 1
X
X D X
X
X T X
JM = ∑ ∥xn − x̃n ∥ = ∑X X
X ∑ (bj xn )bj X
X
X (31)
N n=1 N n=1 X
X X
X
Xj=M+1
X X
X
N D N D
1 1
= ∑ ∑ (bjT xn )2 = ∑ ∑ bjT xn bjT xn (32)
N n=1 j=M+1 N n=1 j=M+1
1 N D T T
D
T 1 N T
D
T
= ∑ ∑ bj xn xn bj = ∑ bj ( ∑ xn xn ) bj = ∑ bj Sbj (33)
N n=1 j=M+1 j=M+1 N n=1 j=M+1
D D ⎛⎛ D T⎞ ⎞
= ∑ tr (bjT Sbj ) = ∑ tr (Sbj bjT ) = tr ∑ bj bj S . (34)
j=M+1 j=M+1 ⎝⎝j=M+1 ⎠ ⎠

ThienNV Chapter 10. Dimensionality Reduction with PCA 15 / 23


Hence the followings are equivalent

Minimize the average squared reconstruction error.


Minimize the variance of the data when projected onto the orthogonal
complement of the principal subspace.
Maximize the variance of the projection that we retain in the principal
subspace which links the projection loss immediately to the
maximum-variance formulation of PCA.

The average squared reconstruction error when projecting onto the


M-dimensional principal subspace is
D
JM = ∑ λj , (35)
j=M+1

where λi are the eigenvalues of the data covariance matrix. Therefore, to


minimize (35) we need to select the smallest D − M eigenvalues, which
then implies that their corresponding eigenvectors are the basis of the
orthogonal complement of the principal subspace. Consequently, the basis
of the principal subspace comprises the eigenvectors b1 , . . . , bM that are
associated with the largest M eigenvalues of the data covariance matrix.
ThienNV Chapter 10. Dimensionality Reduction with PCA 16 / 23
10.4 Eigenvector Computation and Low-Rank
Approximations
The SVD of data matrix X = [x1 ⋯ xN ] ∈ RD×N is given by
X = UΣV T , (36)
where U ∈ RD×D , V ∈ RN×N are orthogonal matrices and Σ ∈ RD×N is a
matrix whose only nonzero entries are singular values σii ≥ 0. The data
covariance matrix
1 1 1
S = XX T = UΣV T V ΣT U T = UΣΣT U T (37)
N N N
The columns of U are the eigenvectors of XX T (see Section 4.5).
Futhermore, the eigenvalues λd of S are related to the singular values of X
via
σ2
λd = d (38)
N
This relation provides the connection between the maximum variance view
and the singular value decomposition.
ThienNV Chapter 10. Dimensionality Reduction with PCA 17 / 23
PCA Using Low-Rank Matrix Approximations

To maximize the variance of the projected data, PCA chooses the columns
of U in (37) to be the eigenvectors that are associated with the M largest
eigenvalues of the data covariance matrix S so that we identify U as the
projection matrix B in (3), which projects the original data onto a
lower-dimensional subspace of dimension M. The Eckart-Young theorem
(Section 4.6) offers a direct way to estimate the low-dimensional
representation. Consider the best rank-M approximation of X

X̃M ∶= arg min ∥X − A∥2 ∈ RD×N , (39)


rk(A)≤M

where ∥⋅∥2 is the spectral norm defined in Section 4.6.

ThienNV Chapter 10. Dimensionality Reduction with PCA 18 / 23


Practical Aspects
For large size matrices, this is not possible to find all eigenvalues and
eigenvectors. In practice, we solve for eigenvalues or singular values using
iterative methods, which are implemented in all modern packages for linear
algebra.
In PCA, we only require a few eigenvectors. If we are interested in only the
first few eigenvectors (with the largest eigenvalues), then iterative
processes, which directly optimize these eigenvectors, are computationally
more efficient than a full eigendecomposition (or SVD). In the extreme
case of only needing the first eigenvector, a simple method called the
power iteration: chooses a random vector x0 not in the null space of S and
follows the iteration
Sxk
xk+1 = (40)
∥Sxk ∥

This sequence of vectors converges to the eigenvector associated with the


largest eigenvalue of S.
ThienNV Chapter 10. Dimensionality Reduction with PCA 19 / 23
10.5 PCA in High Dimensions

We will provide a method for PCA when fewer data points than
dimensions N << D.
Assume we have a centered dataset x1 , . . . , xN ∈ RD and there are no
duplicate data points. So rk(S) = N, S has D − N + 1 many eigenvalues
that are 0. We will exploit this and turn the D × D covariance matrix into
an N × N covariance matrix whose eigenvalues are all positive.
Let (bm , m = 1, . . . , M) is a basis vector of the principal subspace.
1
Sbm = XX T bm = λm bm (41)
N
1 T 1
⇒ X XX T bm = λm X T bm ⇔ X T Xcm = λm cm . (42)
N N

ThienNV Chapter 10. Dimensionality Reduction with PCA 20 / 23


The eigenvector of the N × N matrix N1 X T X ∈ RN×N associated with λm
as cm ∶= X T bm ( N1 X T X and S have the same nonzero eigenvalues).
1 T
Recover the original eigenvectors via the eigenvectors of NX X

1
XX T Xcm = λm Xcm , which means Xcm as an eigenvector of S. (43)
N

ThienNV Chapter 10. Dimensionality Reduction with PCA 21 / 23


10.6 Key Steps of PCA in Practice

1. Mean subtraction
2. Standardization
3. Eigendecomposition of the covariance matrix
4. Projection

ThienNV Chapter 10. Dimensionality Reduction with PCA 22 / 23


Summary

We have studied:

Maximum variance perspective of PCA;


PCA as an algorithm that directly minimizes the average
reconstruction error;
PCA using low-rank matrix approximations;
PCA in high dimensions.

ThienNV Chapter 10. Dimensionality Reduction with PCA 23 / 23

You might also like