Chapter 10. Dimensionality Reduction With PCA

Chapter 10.
Dimensionality Reduction with PCA
0
Equivalent to 1 slot. Lecturers can use extra materials for other slots in the chater
ThienNV Chapter 10. Dimensionality Reduction with PCA 1 / 23
Chapter 10. Dimensionality Reduction with PCA
1 10.1 Problem Setting

2 10.2 Maximum Variance Perspective
3 10.3 Projection Perspective
4 10.4 Eigenvector Computation and Low-Rank Approximations
5 10.5 PCA in High Dimensions
6 10.6 Key Steps of PCA in Practice
10.1 Problem Setting
Consider a data set X = {x1 , . . . , xN }, xn ∈ RD with mean 0 and the data
covariance matrix:
1 N T
S= ∑ xn xn . (1)
N n=1
Assume that there exists a low-dimensional compressed representation
(code) of xn
zn = B T xn ∈ RM , (2)
where
B ∶= [b1 ⋯bM ] ∈ RD×M the projection matrix is orthonormal. (3)
We seek an M-dimensional subspace U of RD onto which we project the

data. Denote the projected data by x̃n ∈ U, and their coordinates by zn .
Our aim is to find projections x̃n so that they are as similar to the original
data xn and minimize the loss due to compression.
10.2 Maximum Variance Perspective
Direction with Maximal Variance:
Seek a single vector b1 ∈ RD that maximizes the variance of the projected
data, that is we aim to maximize the variance of the first coordinate z1 of
z ∈ RM so that
1 N 2 1 N T 2 1 N T T
V1 ∶= V [z1 ] = ∑ z1n = ∑ (b1 xn ) = ∑ b1 xn xn b1 (4)
N n=1 N n=1 N n=1
1 N
= b1T ( T T
∑ xn xn ) b1 = b1 Sb1 .
N n=1
We lead to a constrained optimization problem
max b1T Sb1 (5)
b1
subject to ∥b1 ∥2 = 1 (6)

which has the Lagrangian
L(b1 , λ) = b1T Sb1 − λ1 (1 − b1T b1 ) (7)
Setting the partial derivatives of L to 0
∂L
= 2b1T S − 2λ1 b1T = 0, (8)
∂b1
∂L
= 1 − b1T b1 = 0, (9)
∂λ1
we obtain
Sb1 = λ1 b1 , b1T b1 = 1. (10)
Hence b1 is an eigenvector corresponding to the eigenvalue λ1 of the data

covariance matrix S, and
V1 = b1T Sb1 = λ1 . (11)
λ1 is called the first principal component.

M-dimensional Subspace
Assume that we have found the first m − 1 principal components

b1 , . . . , bm−1 as the m − 1 eigenvectors of S that are associated with the
largest m − 1 eigenvalues λ1 , . . . , λm−1 .
Since S is symmetric, these eigenvectors form an orthonormal basis of an
(m − 1)-dimensional subspace of RD . We find principal components that
compress the remaining information.
Set X ∶= [x1 . . . xN ] ∈ RD×N and Bm−1 ∶= ∑m−1 T
i=1 bi bi the projection matrix
that projects onto the subspace spanned by b1 , . . . , bm−1 . The new data
matrix
m−1
X̂ ∶= X − ∑ bi biT X = X − Bm−1 X . (12)
i=1

Let Ŝ be the data covariance matrix of the transformed dataset

Xˆ ∶= {x̂1 , . . . , x̂N }.
To find the mth principal component, we solve a constrained optimization
problem
1 N 2 1 N T 2 T
max [Vm = Vm [zm ] = ∑ z = ∑ (b x̂n ) = bm Ŝbm ] (13)
bm N n=1 mn N n=1 m
subject to ∥bm ∥2 = 1.
The optimal solution bm is the eigenvector of Ŝ that is associated with the

largest eigenvalue of Ŝ.

We shall show that the sets of eigenvectors of S and Ŝ are identical.

Consider and eigenvector bi of S, i.e., Sbi = λi bi . We have
1 1
Ŝbi = X̂ X̂ T bi = (X − Bm−1 X )(X − Bm−1 X )T bi
N N
= (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi . (14)
If i ≥ m, then bi is orthogonal to the first m − 1 principal components and

Bm−1 bi = 0. Hence
Ŝbi = (S − Bm−1 S)bi = Sbi = λi bi , (15)
which shows that bm is an eigenvector of Ŝ and λm is the largest

eigenvalue of Ŝ and λm is the mth largest eigenvalue of S, and both have
the associated eigenvector bm .

If i < m, then bi is a basis vector of the principal subspace onto which

Bm−1 projects. Since b1 , . . . , bm−1 are an ONB of this principal subspace,
we obtain Bm−1 bi = bi . Thus,
Ŝbi = (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi = 0 = 0bi . (16)
This means that b1 , . . . , bm−1 are also eigenvectors of Ŝ, but they are
associated with eigenvalue 0 so that b1 , . . . , bm−1 span the null space of Ŝ.

By (15), the variance of the data projected onto the mth principal
component is
T T
Vm = bm Sbm = λm bm bm = λ m . (17)
To find an M-dimensional subspace of RD that retains as much
information as possible, PCA tells us to choose the columns of the matrix
B in (3) as the M eigenvectors of the data covariance matrix S that are
associated with the M largest eigenvalues. The maximum amount of
variance PCA can capture with the first M principal components is
M
VM = ∑ λm (18)
m=1
where the λm are the M largest eigenvalues of the data covariance matrix
S. Consequently, the variance lost by data compression via PCA is
D
JM ∶= ∑ λi = VD − VM . (19)
j=M+1

10.3 Projection Perspective
Setting and Objective We will derive PCA as an algorithm that directly

minimizes the average reconstruction error.
Assume that B = {b1 , . . . , bD } is an orthonormal basis of RD . Any x ∈ RD
can be written as
D M D
x = ∑ ξd bd = ∑ ξm bm + ∑ ξj bj where ξd ∈ R. (20)
d=1 m=1 j=M+1
Find a vectors x̃ in an M-dimensional subspace U of RD so that

M
x̃ = ∑ zm bm (21)
m=1
is as similar to x as possible.

Without loss of generality, assume that the dataset X = {x1 , . . . , xN } is
centered at 0, E [X ] = 0. We interested in finding the best linear
projection of X onto U the principal subspace and ONB b1 , . . . , bM of U.
The projections of the data points
M
x̃n ∶= ∑ zmn bm = Bzn ∈ RD . (22)
m=1
The reconstruction error
1 N 2
JM ∶= ∑ ∥xn − x̃n ∥ . (23)
N n=1
To find the coordinates zn = [z1n , . . . , zMn ]T ∈ RM of x̃n w.r.t the basis

(b1 , . . . , bM ) and the ONB of the principal subspace, we follow two-step.
First, we optimize the coordinate zn for a given ONB (b1 , . . . , bM );
second, we find the optimal ONB.

Finding Optimal Coordinates
We have
∂JM ∂JM ∂ x̃n
= ,
∂zin ∂ x̃n ∂zin
∂JM 2
= − (xn − x̃n )T ∈ R1×D (24)
∂ x̃n N
M
∂ x̃n ∂
= ( ∑ zmn bm ) = bi .
∂zin ∂zin m=1
Hence
M T
∂JM 2 2
= − (xn − x̃n )T bi = − (xn − ∑ zmn bm ) bi (25)
∂zin N N m=1
2 T T 2 T
= − (xn bi − zin bi bi ) = − (xn bi − zin ). (26)
N N
Setting this to 0 obtain
zin = xnT bi = biT xn . (27)
Finding the Basis of the Principal Subspace
To determine the basis vectors b1 , . . . , bM of the principal subspace, we
rephrase the loss function (23) using the results we have so far. To
reformulate the loss function, we exploit our results from before
M M M
T
x̃n = ∑ zmn bm = ∑ (xnT bm )bm = ( ∑ bm bm ) xn . (28)
m=1 m=1 m=1
(27)
D D D
xn = ∑ zdn bd = ∑ (xnT bd )bd = ( ∑ bd bdT ) xn
d=1 d=1 d=1
M ⎛ D
T T⎞
= ( ∑ bm bm ) xn + ∑ bj bj xn . (29)
m=1 ⎝j=M+1 ⎠
Hence the difference vector between the original data point and its
projection is
⎛ D T⎞
D
T
xn − x̃n = ∑ bj bj xn = ∑ (xn bj )bj . (30)
⎝j=M+1 ⎠ j=M+1

Reformulate the loss function
N X X 2
1 N 2 (30) 1
X
X D X
X
X T X
JM = ∑ ∥xn − x̃n ∥ = ∑X X
X ∑ (bj xn )bj X
X
X (31)
N n=1 N n=1 X
X X
X
Xj=M+1
X X
X
N D N D
1 1
= ∑ ∑ (bjT xn )2 = ∑ ∑ bjT xn bjT xn (32)
N n=1 j=M+1 N n=1 j=M+1
1 N D T T
D
T 1 N T
D
T
= ∑ ∑ bj xn xn bj = ∑ bj ( ∑ xn xn ) bj = ∑ bj Sbj (33)
N n=1 j=M+1 j=M+1 N n=1 j=M+1
D D ⎛⎛ D T⎞ ⎞
= ∑ tr (bjT Sbj ) = ∑ tr (Sbj bjT ) = tr ∑ bj bj S . (34)
j=M+1 j=M+1 ⎝⎝j=M+1 ⎠ ⎠

Hence the followings are equivalent
Minimize the average squared reconstruction error.

Minimize the variance of the data when projected onto the orthogonal
complement of the principal subspace.
Maximize the variance of the projection that we retain in the principal
subspace which links the projection loss immediately to the
maximum-variance formulation of PCA.
The average squared reconstruction error when projecting onto the

M-dimensional principal subspace is
D
JM = ∑ λj , (35)
j=M+1
where λi are the eigenvalues of the data covariance matrix. Therefore, to

minimize (35) we need to select the smallest D − M eigenvalues, which
then implies that their corresponding eigenvectors are the basis of the
orthogonal complement of the principal subspace. Consequently, the basis
of the principal subspace comprises the eigenvectors b1 , . . . , bM that are
associated with the largest M eigenvalues of the data covariance matrix.
10.4 Eigenvector Computation and Low-Rank
Approximations
The SVD of data matrix X = [x1 ⋯ xN ] ∈ RD×N is given by
X = UΣV T , (36)
where U ∈ RD×D , V ∈ RN×N are orthogonal matrices and Σ ∈ RD×N is a
matrix whose only nonzero entries are singular values σii ≥ 0. The data
covariance matrix
1 1 1
S = XX T = UΣV T V ΣT U T = UΣΣT U T (37)
N N N
The columns of U are the eigenvectors of XX T (see Section 4.5).
Futhermore, the eigenvalues λd of S are related to the singular values of X
via
σ2
λd = d (38)
N
This relation provides the connection between the maximum variance view
and the singular value decomposition.
PCA Using Low-Rank Matrix Approximations
To maximize the variance of the projected data, PCA chooses the columns
of U in (37) to be the eigenvectors that are associated with the M largest
eigenvalues of the data covariance matrix S so that we identify U as the
projection matrix B in (3), which projects the original data onto a
lower-dimensional subspace of dimension M. The Eckart-Young theorem
(Section 4.6) offers a direct way to estimate the low-dimensional
representation. Consider the best rank-M approximation of X
X̃M ∶= arg min ∥X − A∥2 ∈ RD×N , (39)

rk(A)≤M
where ∥⋅∥2 is the spectral norm defined in Section 4.6.

Practical Aspects
For large size matrices, this is not possible to find all eigenvalues and
eigenvectors. In practice, we solve for eigenvalues or singular values using
iterative methods, which are implemented in all modern packages for linear
algebra.
In PCA, we only require a few eigenvectors. If we are interested in only the
first few eigenvectors (with the largest eigenvalues), then iterative
processes, which directly optimize these eigenvectors, are computationally
more efficient than a full eigendecomposition (or SVD). In the extreme
case of only needing the first eigenvector, a simple method called the
power iteration: chooses a random vector x0 not in the null space of S and
follows the iteration
Sxk
xk+1 = (40)
∥Sxk ∥
This sequence of vectors converges to the eigenvector associated with the

largest eigenvalue of S.
10.5 PCA in High Dimensions
We will provide a method for PCA when fewer data points than
dimensions N << D.
Assume we have a centered dataset x1 , . . . , xN ∈ RD and there are no
duplicate data points. So rk(S) = N, S has D − N + 1 many eigenvalues
that are 0. We will exploit this and turn the D × D covariance matrix into
an N × N covariance matrix whose eigenvalues are all positive.
Let (bm , m = 1, . . . , M) is a basis vector of the principal subspace.
1
Sbm = XX T bm = λm bm (41)
N
1 T 1
⇒ X XX T bm = λm X T bm ⇔ X T Xcm = λm cm . (42)
N N

The eigenvector of the N × N matrix N1 X T X ∈ RN×N associated with λm
as cm ∶= X T bm ( N1 X T X and S have the same nonzero eigenvalues).
1 T
Recover the original eigenvectors via the eigenvectors of NX X
1
XX T Xcm = λm Xcm , which means Xcm as an eigenvector of S. (43)
N

10.6 Key Steps of PCA in Practice
1. Mean subtraction
2. Standardization
3. Eigendecomposition of the covariance matrix
4. Projection

Summary
We have studied:
Maximum variance perspective of PCA;

PCA as an algorithm that directly minimizes the average
reconstruction error;
PCA using low-rank matrix approximations;
PCA in high dimensions.

Chapter 10. Dimensionality Reduction With PCA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 10. Dimensionality Reduction With PCA

Uploaded by

Copyright:

Available Formats

Chapter 10.

Dimensionality Reduction with PCA

1 10.1 Problem Setting

We seek an M-dimensional subspace U of RD onto which we project the

subject to ∥b1 ∥2 = 1 (6)

Sb1 = λ1 b1 , b1T b1 = 1. (10)

Hence b1 is an eigenvector corresponding to the eigenvalue λ1 of the data

V1 = b1T Sb1 = λ1 . (11)

λ1 is called the first principal component.

ThienNV Chapter 10. Dimensionality Reduction with PCA 5 / 23

Assume that we have found the first m − 1 principal components

ThienNV Chapter 10. Dimensionality Reduction with PCA 6 / 23

Let Ŝ be the data covariance matrix of the transformed dataset

The optimal solution bm is the eigenvector of Ŝ that is associated with the

ThienNV Chapter 10. Dimensionality Reduction with PCA 7 / 23

We shall show that the sets of eigenvectors of S and Ŝ are identical.

If i ≥ m, then bi is orthogonal to the first m − 1 principal components and

Ŝbi = (S − Bm−1 S)bi = Sbi = λi bi , (15)

which shows that bm is an eigenvector of Ŝ and λm is the largest

ThienNV Chapter 10. Dimensionality Reduction with PCA 8 / 23

If i < m, then bi is a basis vector of the principal subspace onto which

Ŝbi = (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi = 0 = 0bi . (16)

ThienNV Chapter 10. Dimensionality Reduction with PCA 9 / 23

ThienNV Chapter 10. Dimensionality Reduction with PCA 10 / 23

Setting and Objective We will derive PCA as an algorithm that directly

Find a vectors x̃ in an M-dimensional subspace U of RD so that

ThienNV Chapter 10. Dimensionality Reduction with PCA 11 / 23

The reconstruction error

To find the coordinates zn = [z1n , . . . , zMn ]T ∈ RM of x̃n w.r.t the basis

ThienNV Chapter 10. Dimensionality Reduction with PCA 12 / 23

ThienNV Chapter 10. Dimensionality Reduction with PCA 14 / 23

ThienNV Chapter 10. Dimensionality Reduction with PCA 15 / 23

Minimize the average squared reconstruction error.

The average squared reconstruction error when projecting onto the

where λi are the eigenvalues of the data covariance matrix. Therefore, to

X̃M ∶= arg min ∥X − A∥2 ∈ RD×N , (39)

where ∥⋅∥2 is the spectral norm defined in Section 4.6.

ThienNV Chapter 10. Dimensionality Reduction with PCA 18 / 23

This sequence of vectors converges to the eigenvector associated with the

ThienNV Chapter 10. Dimensionality Reduction with PCA 20 / 23

ThienNV Chapter 10. Dimensionality Reduction with PCA 21 / 23

ThienNV Chapter 10. Dimensionality Reduction with PCA 22 / 23

Maximum variance perspective of PCA;

ThienNV Chapter 10. Dimensionality Reduction with PCA 23 / 23

You might also like