You are on page 1of 33

# Principal Component Analysis

Frank Wood

December 8, 2009

This lecture borrows and quotes from Joliffes Principle Component Analysis book. Go buy it!

## Principal Component Analysis

The central idea of principal component analysis (PCA) is
to reduce the dimensionality of a data set consisting of a
large number of interrelated variables, while retaining as
much as possible of the variation present in the data set.
This is achieved by transforming to a new set of variables,
the principal components (PCs), which are uncorrelated,
and which are ordered so that the first few retain most of
the variation present in all of the original variables.
[Jolliffe, Pricipal Component Analysis, 2nd edition]

PCA rotation

## Figure: PCA Projected Gaussian PDF

PCA in a nutshell
Notation
I

## x is a vector of p random variables

k is a vector of p constants
P
0k x = pj=1 kj xj

Procedural description
I

## Next find another linear function of x, 02 x, uncorrelated with

01 x maximum variance.

Iterate.

Goal
It is hoped, in general, that most of the variation in x will be
accounted for by m PCs where m << p.

Derivation of PCA
Assumption and More Notation
I

## Foreshadowing : will be replaced with S, the sample

covariance matrix, when is unknown.

Shortcut to solution
I

## For k = 1, 2, . . . , p the k th PC is given by zk = 0k x where k

is an eigenvector of corresponding to its k th largest
eigenvalue k .

## If k is chosen to have unit length (i.e. 0k k = 1) then

Var(zk ) = k

Derivation of PCA
First Step
I

length vector).

I

## To maximize 0k k subject to 0k k = 1 we use the

technique of Lagrange multipliers. We maximize the function
0k k (0k k 1)
w.r.t. to k by differentiating w.r.t. to k .

Derivation of PCA
Constrained maximization - method of Lagrange multipliers
I

This results in

d
0k k k (0k k 1) = 0
dk
k k k = 0
k

= k k

## This should be recognizable as an eigenvector equation where

k is an eigenvector of b f and k is the associated
eigenvalue.

## Which eigenvector should we choose?

Derivation of PCA
Constrained maximization - method of Lagrange multipliers
I

## If we recognize that the quantity to be maximized

0k k = 0k k k = k 0k k = k
then we should choose k to be as big as possible. So, calling
1 the largest eigenvector of and 1 the corresponding
eigenvector then the solution to
1 = 1 1
is the 1st principal component of x.

## We will demonstrate this for k = 2, k > 2 is more involved

but similar.

Derivation of PCA
Constrained maximization - more constraints
I

## The second PC, 2 x maximizes 2 2 subject to being

uncorrelated with 1 x.

## The uncorrelation constraint can be expressed using any of

these equations
cov(01 x, 02 x) = 01 2 = 02 1 = 02 1 01
= 1 02 = 1 01 2 = 0

## Of these, if we choose the last we can write an Langrangian

to maximize 2
02 2 2 (02 2 1) 02 1

Derivation of PCA
Constrained maximization - more constraints
I

## Differentiation of this quantity w.r.t. 2 (and setting the

result equal to zero) yields

d
02 2 2 (02 2 1) 02 1 = 0
d2
2 2 2 1 = 0

## If we left multiply 1 into this expression

01 2 2 01 2 01 1 = 0
0 0 1 = 0
then we can see that must be zero and that when this is
true that we are left with
2 2 2 = 0

Derivation of PCA
Clearly
2 2 2 = 0
is another eigenvalue equation and the same strategy of choosing
2 to be the eigenvector associated with the second largest
eigenvalue yields the second PC of x, namely 02 x.
This process can be repeated for k = 1 . . . p yielding up to p
different eigenvectors of along with the corresponding
eigenvalues 1 , . . . p .
Furthermore, the variance of each of the PCs are given by
Var[0k x] = k ,

k = 1, 2, . . . , p

Properties of PCA
For any integer q, 1 q p, consider the orthonormal linear
transformation
y = B0 x
where y is a q-element vector and B0 is a q p matrix, and let
y = B0 B be the variance-covariance matrix for y. Then the
trace of y , denoted tr(y ), is maximized by taking B = Aq ,
where Aq consists of the first q columns of A.
What this means is that if you want to choose a lower dimensional
projection of x, the choice of B described here is probably a good
one. It maximizes the (retained) variance of the resulting variables.
In fact, since the projections are uncorrelated, the percentage of
variance accounted for by retaining the first q PCs is given by
Pq
k
Pk=1
100
p
k=1 k

## PCA using the sample covariance matrix

If we recall that the sample covariance matrix (an unbiased
estimator for the covariance matrix of x) is given by
S=

1
X0 X
n1

## where X is a (n p) matrix with (i, j)th element (xij xj ) (in

other words, X is a zero mean design matrix).
We construct the matrix A by combining the p eigenvectors of S
(or eigenvectors of X0 X theyre the same) then we can define a
matrix of PC scores
Z = XA
Of course, if we instead form Z by selecting the q eigenvectors
corresponding to the q largest eigenvalues of S when forming A
then we can achieve an optimal (in some senses) q-dimensional
projection of x.

Given the sample covariance matrix
S=

1
X0 X
n1

matrix is to utilize the singular value decomposition of S = A0 A
where A is a matrix consisting of the eigenvectors of S and is a
diagonal matrix whose diagonal elements are the eigenvalues
corresponding to each eigenvector.
Creating a reduced dimensionality projection of X is accomplished
by selecting the q largest eigenvalues in and retaining the q
corresponding eigenvectors from A

## PCA in linear regression

PCA is useful in linear regression in several ways
I

## Reduction in the dimension of the input space leading to

fewer parameters and easier regression.

## Related to the last point, the variance of the regression

coefficient estimator is minimized by the PCA choice of basis.




4.5 1.5
I x N [2 5],
1.5 1.0
I

4.5
1.5

1.5
1.0


)

4.5
1.5

1.5
1.0


)

## Projection of colinear data

The figures before showed the data without the third colinear
design matrix column. Plotting such data is not possible, but its
colinearity is obvious by design.
When PCA is applied to the design matrix of rank q less than p
the number of positive eigenvalues discovered is equal to q the
true rank of the design matrix.
If the number of PCs retained is larger than q (and the data is
perfectly colinear, etc.) all of the variance of the data is retained in
the low dimensional projection.
In this example, when PCA is run on the design matrix of rank 2,
the resulting projection back into two dimensions has exactly the
same distribution as before.

## Reduction in regression coefficient estimator variance

If we take the standard regression model
y = X + 
And consider instead the PCA rotation of X given by
Z = ZA
then we can rewrite the regression model in terms of the PCs
y = Z + .
We can also consider the reduced model
y = Zq q + q
where only the first q PCs are retained.

## Reduction in regression coefficient estimator variance

If we rewrite the regression relation as
y = Z + .
Then we can, because A is orthogonal, rewrite
X = XAA0 = Z
where = A0 .
= A
Clearly using least squares (or ML) to learn
is equivalent
directly.
to learning
And, like usual,

= (Z0 Z)1 Z0 y
= A(Z0 Z)1 Z0 y
so

## Reduction in regression coefficient estimator variance

Without derivation we note that the variance-covariance matrix of
is given by

p
X
2

Var() =
lk1 ak a0k
k=1

## where lk is the k th largest eigenvalue of X0 X, ak is the k th column

of A, and 2 is the observation noise variance, i.e.  N(0, 2 I)
This sheds light on how multicolinearities produce large variances
If an eigenvector lk is small then the
for the elements of .
resulting variance of the estimator will be large.

## Reduction in regression coefficient estimator variance

One way to avoid this is to ignore those PCs that are associated
with small eigenvalues, namely, use biased estimator
=

m
X

lk1 ak a0k X0 y

k=1

where l1:m are the large eigenvalues of X0 X and lm+1:p are the
small.
= 2
Var()

m
X

lk1 ak a0k

k=1

## This is a biased estimator, but, since the variance of this estimator

is smaller it is possible that this could be an advantage.
Homework: find the bias of this estimator. Hint: use the spectral
decomposition of X0 X.

## Problems with PCA

PCA is not without its problems and limitations
I PCA assumes approximate normality of the input space
distribution
I

## PCA may still be able to produce a good low dimensional

projection of the data even if the data isnt normally distributed

## PCA assumes that the input data is real and continuous.

Extensions to consider

## Collins et al, A generalization of principal components analysis

to the exponential family.
Hyv
arinen, A. and Oja, E., Independent component analysis:
algorithms and applications
ISOMAP, LLE, Maximum variance unfolding, etc.

Non-normal data

Non-normal data