Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
0 of .
Results for:
P. 1
A Tutorial on Principal Component Analysis

# A Tutorial on Principal Component Analysis

Ratings: (0)|Views: 154|Likes:
Principal component analisys course
Principal component analisys course

### Availability:

See more
See less

09/27/2011

pdf

text

original

A Tutorial on Principal Component Analysis
Jonathon Shlens
Systems Neurobiology Laboratory, Salk Insitute for Biological StudieLa Jolla, CA 92037 and Institute for Nonlinear Science, University of California, San DiegLa Jolla, CA 92093-0402
(Dated: December 10, 2005; Version 2)
Principal component analysis (PCA) is a mainstay of modern data analysis - a black box thatis widely used but poorly understood. The goal of this paper is to dispel the magic behind thisblack box. This tutorial focuses on building a solid intuition for how and why principal componentanalysis works; furthermore, it crystallizes this knowledge by deriving from simple intuitions, themathematics behind
PCA
. This tutorial does not shy away from explaining the ideas informally,nor does it shy away from the mathematics. The hope is that by addressing both aspects, readersof all levels will be able to gain a better understanding of
PCA
as well as the when, the how andthe why of applying this technique.
I. INTRODUCTION
Principal component analysis (
PCA
) has been calledone of the most valuable results from applied linear al-gebra.
PCA
is used abundantly in all forms of analysis -from neuroscience to computer graphics - because it is asimple, non-parametric method of extracting relevant in-formation from confusing data sets. With minimal addi-tional eﬀort
PCA
provides a roadmap for how to reducea complex data set to a lower dimension to reveal thesometimes hidden, simpliﬁed structure that often under-lie it.The goal of this tutorial is to provide both an intuitivefeel for
PCA
, and a thorough discussion of this topic.We will begin with a simple example and provide an intu-itive explanation of the goal of
PCA
. We will continue byadding mathematical rigor to place it within the frame-work of linear algebra to provide an explicit solution. Wewill see how and why
PCA
is intimately related to themathematical technique of singular value decomposition(
SVD
). This understanding will lead us to a prescriptionfor how to apply
PCA
in the real world. We will discussboth the assumptions behind this technique as well aspossible extensions to overcome these limitations.The discussion and explanations in this paper are infor-mal in the spirit of a tutorial. The goal of this paper is to
educate
. Occasionally, rigorous mathematical proofs arenecessary although relegated to the Appendix. Althoughnot as vital to the tutorial, the proofs are presented forthe adventurous reader who desires a more complete un-derstanding of the math. The only assumption is that thereader has a working knowledge of linear algebra. Pleasefeel free to contact me with any suggestions, correctionsor comments.
shlens@salk.edu
II. MOTIVATION: A TOY EXAMPLE
Here is the perspective: we are an experimenter. Weare trying to understand some phenomenon by measur-ing various quantities (e.g. spectra, voltages, velocities,etc.) in our system. Unfortunately, we can not ﬁgure outwhat is happening because the data appears clouded, un-clear and even redundant. This is not a trivial problem,but rather a fundamental obstacle in empirical science.Examples abound from complex systems such as neu-roscience, photometry, meteorology and oceanography -the number of variables to measure can be unwieldy andat times even
deceptive
, because the underlying relation-ships can often be quite simple.Take for example a simple toy problem from physicsdiagrammed in Figure 1. Pretend we are studying themotion of the physicist’s ideal spring. This system con-sists of a ball of mass
m
attached to a
massless, friction-less
spring. The ball is released a small distance awayfrom equilibrium (i.e. the spring is stretched). Becausethe spring is “ideal,” it oscillates indeﬁnitely along the
x
-axis about its equilibrium at a set frequency.This is a standard problem in physics in which the mo-tion along the
x
direction is solved by an explicit functionof time. In other words, the underlying dynamics can beexpressed as a function of a single variable
x
.However, being ignorant experimenters we do not knowany of this. We do not know which, let alone howmany, axes and dimensions are important to measure.Thus, we decide to measure the ball’s position in athree-dimensional space (since we live in a three dimen-sional world). Speciﬁcally, we place three movie camerasaround our system of interest. At 200 Hz each moviecamera records an image indicating a two dimensionalposition of the ball (a projection). Unfortunately, be-cause of our ignorance, we do not even know what arethe
real
x
”,
y
” and “
z
” axes, so we choose three cam-era axes
{

a
,
b
,
c
}
at some arbitrary angles with respectto the system. The angles between our measurements

2
FIG. 1 A diagram of the toy example.
might not even be 90
o
! Now, we record with the camerasfor several minutes. The big question remains:
how dowe get from this data set to a simple equation of
x
We know a-priori that if we were smart experimenters,we would have just measured the position along the
x
-axis with one camera. But this is not what happens in thereal world. We often do not know which measurementsbest reﬂect the dynamics of our system in question. Fur-thermore, we sometimes record more dimensions than weactually need!Also, we have to deal with that pesky, real-world prob-lem of
noise
. In the toy example this means that weneed to deal with air, imperfect cameras or even frictionin a less-than-ideal spring. Noise contaminates our dataset only serving to obfuscate the dynamics further.
Thistoy example is the challenge experimenters face everyday.
We will refer to this example as we delve further into ab-stract concepts. Hopefully, by the end of this paper wewill have a good understanding of how to systematicallyextract
x
using principal component analysis.
III. FRAMEWORK: CHANGE OF BASIS
The goal of principal component analysis is to computethe most meaningful
basis
to re-express a noisy data set.The hope is that this new basis will ﬁlter out the noiseand reveal hidden structure. In the example of the spring,the explicit goal of
PCA
is to determine: “the dynamicsare along the
x
-axis.” In other words, the goal of
PCA
is to determine thatˆ
x
- the unit basis vector along the
x
-axis - is the important dimension. Determining thisfact allows an experimenter to discern which dynamicsare important, which are just redundant and which are just noise.
A. A Naive Basis
With a more precise deﬁnition of our goal, we needa more precise deﬁnition of our data as well. We treatevery time sample (or experimental trial) as an individualsample in our data set. At each time sample we recorda set of data consisting of multiple measurements (e.g.voltage, position, etc.). In our data set, at one pointin time, camera
A
records a corresponding ball position(
x
A
,y
A
). One sample or trial can then be expressed as a6 dimensional column vector

=
x
A
y
A
x
B
y
B
x
y
where each camera contributes a 2-dimensional projec-tion of the ball’s position to the entire vector

. If werecord the ball’s position for 10 minutes at 120 Hz, thenwe have recorded 10
×
60
×
120 = 72000 of these vectors.With this concrete example, let us recast this problemin abstract terms. Each sample

is an
m
-dimensionalvector, where
m
is the number of measurement types.Equivalently, every sample is a vector that lies in an
m
-dimensional
vector space
spanned by some orthonormalbasis. From linear algebra we know that all measurementvectors form a linear combination of this set of unit lengthbasis vectors. What is this orthonormal basis?This question is usually a tacit assumption often over-looked. Pretend we gathered our toy example data above,but only looked at camera
A
. What is an orthonormal ba-sis for (
x
A
,y
A
)? A naive choice would be
{
(1
,
0)
,
(0
,
1)
}
,but why select this basis over
{
(
√
22
,
√
22
)
,
(
√
22
,
√
22
)
}
orany other arbitrary rotation? The reason is that the
naive basis reﬂects the method we gathered the data.
Pre-tend we record the position (2
,
2). We did
not
record2
2 in the (
√
22
,
√
22
) direction and 0 in the perpindicu-lar direction. Rather, we recorded the position (2
,
2) onour camera meaning 2 units
up
and 2 units to the
left
in our camera window. Thus our naive basis reﬂects themethod we measured our data.How do we express this naive basis in linear algebra?In the two dimensional case,
{
(1
,
0)
,
(0
,
1)
}
can be recastas individual row vectors. A matrix constructed out of these row vectors is the 2
×
2 identity matrix
. We cangeneralize this to the
m
-dimensional case by constructingan
m
×
m
identity matrix
B
=
b
1
b
2
...
b
m
=
1 0
···
00 1
···
0............0 0
···
1
=
I
where each
row
is an orthornormal basis vector
b
i
with
m
components. We can consider our naive basis as theeﬀective starting point. All of our data has been recordedin this basis and thus it can be trivially expressed as alinear combination of
{
b
i
}
.

3
B. Change of Basis
With this rigor we may now state more precisely what
PCA
Is there another basis, which is a linear com-bination of the original basis, that best re-expresses our data set?
A close reader might have noticed the conspicuous ad-dition of the word
linear
. Indeed,
PCA
makes one strin-gent but powerful assumption:
linearity
. Linearity vastlysimpliﬁes the problem by (1) restricting the set of poten-tial bases, and (2) formalizing the implicit assumption of continuity in a data set.
1
With this assumption
PCA
is now limited to re-expressing the data as a
linear combinatio
of its ba-sis vectors. Let
X
be the original data set, where each
column
is a single sample (or moment in time) of our dataset (i.e.

). In the toy example
X
is an
m
×
n
matrixwhere
m
= 6 and
n
= 72000. Let
Y
be another
m
×
n
matrix related by a linear transformation
P
.
X
is theoriginal recorded data set and
Y
is a re-representation of that data set.
PX
=
Y
(1)Also let us deﬁne the following quantities.
2
p
i
are the
rows
of
P
x
i
are the
columns
of
X
(or individual

).
y
i
are the
columns
of
Y
.Equation 1 represents a
change of basis
and thus can havemany interpretations.1.
P
is a matrix that transforms
X
into
Y
.2. Geometrically,
P
is a rotation and a stretch whichagain transforms
X
into
Y
.3. The rows of
P
,
{
p
1
,...,
p
m
}
, are a set of new basisvectors for expressing the
columns
of
X
.The latter interpretation is not obvious but can be seenby writing out the explicit dot products of
PX
.
PX
=
p
1
...
p
m
x
1
···
x
n
Y
=
p
1
·
x
1
···
p
1
·
x
n
.........
p
m
·
x
1
···
p
m
·
x
n
1
A subtle point it is, but we have already assumed linearity byimplicitly stating that the data set even characterizes the dy-namics of the system. In other words, we are already relying onthe superposition principal of linearity to believe that the dataprovides an ability to interpolate between individual data points
2
In this section
x
i
and
y
i
are
column
vectors, but be forewarned.In all other sections
x
i
and
y
i
are
row
vectors.
We can note the form of each column of
Y
.
y
i
=
p
1
·
x
i
...
p
m
·
x
i
We recognize that each coeﬃcient of
y
i
is a dot-productof
x
i
with the corresponding row in
P
. In other words,the
j
th
coeﬃcient of
y
i
is a projection on to the
j
th
row of
P
. This is in fact the very form of an equation where
y
i
isa projection on to the basis of
{
p
1
,...,
p
m
}
. Therefore,the
rows
of
P
are indeed a new set of basis vectors forrepresenting of
columns
of
X
.
C. Questions Remaining
By assuming linearity the problem reduces to ﬁnd-ing the appropriate
change of basis
. The row vectors
{
p
1
,...,
p
m
}
in this transformation will become the
principal components
of
X
. Several questions now arise.
What is the best way to “re-express”
X
?
What is a good choice of basis
P
?
These questions must be answered by next asking our-selves what features we would like
Y
to exhibit
linearity
are re-quired to arrive at a reasonable result. The selection of these assumptions is the subject of the next section.
IV. VARIANCE AND THE GOAL
Now comes the most important question:
what does“best express” the data mean?
This section will build upan intuitive answer to this question and along the waytack on additional assumptions. The end of this sectionwill conclude with a mathematical goal for deciphering“garbled” data.In a linear system “garbled” can refer to only threepotential confounds:
noise, rotation
and
redundancy
. Letus deal with each situation individually.
A. Noise and Rotation
Measurement noise in any data set must be low or else,no matter the analysis technique, no information about asystem can be extracted. There exists no absolute scalefor noise but rather all noise is measured relative to themeasurement. A common measure is the
signal-to-noiseratio
(
SNR
), or a ratio of variances
σ
2
,
SNR
=
σ
2
signal
σ
2
noise
.
(2)A high
SNR
(
1) indicates high precision data, while alow
SNR
indicates noise contaminated data.

## Activity (18)

nikunjmittal35 liked this
isaacsamuel2003 liked this
Gupta Rats liked this
mimajha liked this
arbiter007 liked this
syarinaosman liked this
shriram123 liked this
protolith liked this
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->