You are on page 1of 15

Principal Component Analysis

An Introduction
April 15th, 2015
Ari Paul


What is Principal Component Analysis (PCA)?
PCA is a tool for analyzing a complex data set and re-expressing it in simpler terms. The textbook definition
is: a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called principal components. It
replaces a set of observations with a new dataset that may better explain the underlying dynamics of the
system with less data.
Before diving into the meat of PCA, Ill provide 4 slides on Linear Algebra, a necessary toolkit for
understanding and performing PCA.

Linear Algebra
Terminology and Basic Matrix Math
Determinants and Characteristic Equation
Eigenvectors and Eigenvalues
Step 1 Centering and Scaling
Step 2 Covariance Matrix
Step 3 Calculating Eigenvectors
Step 3 (cont) Eigenvector Explanation
Step 4 Re-express the Data
Assumptions and Pitfalls



In the plot above, the data is re-mapped from the

x-axis and y-axis to Principal Components 1 and
2. This delineation better captures the dynamic
of high variance along the PC 1 axis, and low
variance along the PC 2 axis.

Linear Algebra (1 of 4) Terminology and Basic

A single real number is a scalar (e.g. 7). A scalar can be thought of as a matrix with one row and one
column. A vector is a matrix with one row and multiple columns, or one column and multiple rows. [4 0 2
1] is a vector with one row and four columns. This can be referred to as a 1x4 matrix, or m x n matrix more
broadly, where m is the # of rows, and n is the # of columns. We can refer to a specific number within a
matrix using an i subscript to refer to the row and j to refer to the column. In the vector example given
above, V1,3 refers to the 2.
Matrix and Vector Math
Adding and substracting matrices is easy, but they must be of equal dimensions. With matrices of equal
dimensions, you simply add or subtract the number in the same position in each matrix.

Multiplying a matrix by a scalar is similarly easy.

Transposing a matrix is also very easy:


Linear Algebra (2 of 4) Matrices

Multiplying a matrix by a matrix is a bit more complicated. The matrices can be different sizes, but the
number of columns of the first matrix must equal the number of rows of the second matrix. The resulting
matrix always has the number of rows of the first matrix and the number of columns of the second matrix.
This is known as the dot product or outer product. Matrix multiplication is NOT commutative. A x B B x
A. Multiplying a vector by a matrix is identical, since vectors are just matrices with either a single row or

1x8 + 2x10 + 3x12 = 64

This is a special matrix called the identity matrix. It has 1s along the diagonal and zeroes
everywhere else. It is square, but can be of any dimension. It has the interesting property that any matrix A
multiplied by it, will equal itself. A x I = A.

Calculating reciprocals of matrices is tricky, but all we need to know for basic PCA is that a matrix
multipled by its recriprocal will always equal the identity matrix, I. A x A-1 = I. For a reciprocal of
a matrix to exist, it must be square (m = n), and it must have a non-zero determinant.


Linear Algebra (3 of 4) Determinants and

Calculating the determinant of a matrix is an important step in PCA. Its easy for 2x2 matrices,
but for anything larger than 3x3, its prohibitively time consuming to do by hand.

|A| = ad - bc

|A| = a(ei - fh) - b(di - fg) + c(dh - eg)

The characteristic equation or characteristic polynomial of a matrix makes use of the

determinant to identify eigenvalues. The importance of this step will be clear once we dive into the
actual PCA, but first I wanted to introduce the math.
Lets start with matrix A.

Now identify the equation tI A:

And now take the determinant of tI A.

This gives us the characteristic equation of matrix A.
The purpose of this exercise is to identify the polynomial whose zeroes are the eigenvalues of matrix A. Well
dive into eigenvalues next.


Linear Algebra (4 of 4) Eigenvectors and

Geometrically, an eigenvector is the direction of a transformation, and an eigenvalue is the
associated stretch. In the shear mapping of the Mona Lisa to the right, the blue line is
an eigenvector because its direction does not change. Its eigenvalue is 1 because its length
does not change. The red line is not an eigenvector, because its direction changes.

In Linear Algebra terminology, an eigenvector is a vector that points in a direction which is invariant under
the associated linear transformation.
In this equation, is a real number (scalar) known as the eigenvalue. v is the eigenvector. And A is the
original matrix. This equation says that a vector v is an eigenvector of matrix A, if the resulting product can
be restated as a scalar multiple of v. the intuition here is that the vector v is being scaled or stretched
by matrix A, but is not otherwise changing (i.e. its direction is unaffected). This will be clearer with an
Start with matrix A:

The equation

can be rewritten as:

Set the determinant equal to zero:

this is the characteristic equation.

It has roots = 1, and = 3; these are As eigenvalues. To find the eigenvectors, we use the roots to solve
the equation.
For = 3:

which has solution:

w is one of the two eigenvectors of matrix A.

By setting the determinant equal to zero and then finding the roots, we guarantee that the resulting

be orthogonal.

PCA - Introduction
Having established a mathematical framework, we can now work through the PCA example step by step.

Our Example:
I find PCA easiest to understand via example, so Ill walk through a calculation using a very simple data set.
Imagine that we measure the height, neck circumference, and armspan of 4 individuals. Intuitively, these 3
variables are likely to be correlated (e.g. a taller person is more likely to have a longer armspan etc). Our goal
is to re-express the dataset using fewer variables to better and more simply capture the systems underlying
dynamics. All measurements are in inches.

Performing Principal Component Analysis Requires the following 4 steps:

1. Center and scale the data
2. Calculate the covariance matrix
3. Calculate the eigenvalues and eigenvectors of the covariance matrix, and sort them by eigenvalue
4. Multiply the standardized data by the eigenvectors


PCA Steps 1 & 2

1. Center and Scale the Data
This simple step requires subtracting the mean of each column vector from each datapoint and then dividing
by the standard deviation. This results in column vectors with mean zero and standard deviation of 1, and
allows for an apples to apples comparison of variables.

2. Find the Covariance Matrix

Now we identify the covariance matrix of our centered and scaled dataset. The covariance matrix can be
calculated as 1/(n-1)XX T, where X is our centered/scaled data. The diagonals reflect the variance of the
variables to themselves and the off-diagonals reflect the covariances between the variables. Since we
normalized the data, the correlation matrix and covariance matrix will be equal by definition. From this, we
can immediately notice that all of the variables are highly correlated to one another; the weakest correlation is


PCA Step 3
3. Calculate Eigenvectors and Eigenvalues, and Sort by Associated
Realistically youll be using a software package like Matlab to calculate the eigenvectors and eigenvalues,
because it is prohibitively difficult to do by hand for larger matrices, but this example is simple enough for us to
work through in detail. This follows the process laid out on slides 5 and 6.
is rewritten as
setting the determinant equal to zero.

and then we identify the characteristic equation by

The determinant of a 3x3 matrix is |B| = a(ei - fh) - b(di - fg) + c(dh - eg) = aei afh bdi + bfg + cdh
Using the characteristic equation of our covariance matrix, this is:
(t-1)(t-1)(t-1) - (t-1)0.7166^2 - .9676^2(t-1) + .9676(.7166)(.8534) + (.8534).9676(.7166) - .8534^2(t-1)
Which further reduces to: t^3 3t^2 + 0.8219t -0.0054
This polynomial has the roots: 2.6959, 0.0067, and 0.2974. These are the eigenvalues of the covariance
matrix, and their relative sizes reflect their relative importance. The eigenvalues will always sum to the
number of variables (in this case 3). The eigenvalues tell us that the first eigenvector will contain 89.9% of the
variance of the entire dataset, the second will contain just 0.2%, and the third 9.9% We then solve for v as
shown on slide 6 and get the eigenvectors: (0.6052, 0.7779, 0.1688), (0.5770, -0.5748, -0.5802), and (0.5484,
-0.2537, 0.7968) respectively. If you graphed these vectors on a 3-dimensional plot, they would all be
perpendicular to one another.
We then sort the eigenvectors in order of their associated eigenvalue (highest eigenvalue first), and arrange
the eigenvectors as columns, we get the following matrix:


PCA Step 3 (explained)

3. (Continued) - Purpose and Properties of Eigenvalues and
To understand the purpose of using eigenvectors, lets take a step back to the covariance matrix of our
centered/scaled dataset. This covariance matrix reflects that each of our variables has a non-zero correlation
with one another. We want to re-express this data with a set of orthogonal (i.e. zero correlation) variables. This
will let us get a look into the distinct pure forces driving the system. We need to convert a set of correlated
variables into a set of uncorrelated variables; another way of saying the same is that we need to convert an
original covariance matrix with non-zero covariances into a new covariance matrix with only zero covariances
(aka diagonalize the matrix.) The eigenvectors of the covariance matrix are specifically chosen so that they
diagonalize it.
By construction, the eigenvectors are both orthogonal (i.e. perpendicular and uncorrelated), and orthonormal
(unit length 1). Additionally, the signs on each eigenvector are arbitrary, and different software platforms will
output results with different signs. If we flip the sign of every number in a particular eigenvector, its key
characteristics are unchanged. e.g. the vector (1, -3) has the same properties as (-1, 3) in this context; both of
these vectors are perpendicular to the same set of other vectors, and they are of equal length.
The sorting of the eigenvalues is a simple but important step that gets to the heart of PCA. The eigenvector
with the highest eigenvalue has the highest variance, and therefore the most explanatory power for the
original dataset. If you could only pick one vector to describe your original matrix, youd select the eigenvector
with the largest eigenvalue, and this would capture as much of the information in the original dataset as could
possibly be captured in a single vector.
The eigenvectors are sometimes referred to as loadings or principal component coefficients within a PCA



PCA Step 4
4. Multiply the Standardized Data by the Eigenvectors
The ultimate goal of PCA is to re-express the original dataset, and now were finally ready to take that step
and to generate the actual principal components. We simply multiply our scaled and cleaned dataset by the
matrix of eigenvectors that we generated in step 3. As a matrix operation, its a simple as A x V. To generate
a particular datapoint for Principal Component 1, were adding together all of the first observations from the
scaled/cleaned dataset, with each one loaded by its associated eigenvector. So, if PCA1 is the first Principal
Component vector and V represents our eigenvector matrix, then PCA1 1 = x1(v1,1) + y1(v2,1) + z1(v3,1).
Principal Component 1 is uncorrelated to Principal Component 2 by construction. Conceptually, were reexpressing the original dataset in terms of the eigenvectors. Since the eigenvectors are perpendicular to one
another, the principal components are uncorrelated.

To interpret the transformed data, its useful to look at the correlation to the original (unscaled and
uncentered). In the table below, we can see that the first principal component (PC1) is extremely similar to
Height, but also captures most of the information in Neck and Armspan; as we learned from the
eigenvalues, PC1 captures 89.9% of the total information of the dataset. PC2 seems to relate primarily to
Neck and Armspan, and PC3 is not closely connected to any of the three variables and adds almost no
information. We could discard the second and third principal components and still retain most of the
information of the original dataset.



PCA Interpretation

We began with 3 measurements of 4 people, but we

suspect that one or more of the variables may be
redundant, since intuition suggests that a persons
height, neck circumference, and armspan are all
positive correlated.

The PCA results tell us that we can capture 89.9% of the

datasets information with just a single variable, and 99.8%
with 2 variables. PC 1 is extremely similar to height, but
also captures a substantial portion of the information in
neck and armspan. If all we care about is comparing the
general size of a population, PC 1 may be sufficient. A high
PC 1 score means a person is almost certainly tall, and
probably also has a large neck and armspan. PC 2 functions
as a differentiating factor for neck and armspan; it helps
distinguish people who are tall but with unusually short
armspan etc.
To summarize, weve taken a dataset with 3 variables and generated a set of eigenvectors that are
perpendicular to one another. Then we used those eigenvectors to generate a new dataset consisting of
three principal components with zero correlation. We have re-expressed the data from Height and
Neck and Armspan to PC 1, PC 2, and PC 3, where the latter variables are orthogonal. And this is a key
point: by construction, PC1 contains as much of the variance in the entire dataset as possible, so we have
the option of ignoring PC 3 with hardly any loss of information, and we may even choose to ignore PC 2.


PCA Assumptions and Pitfalls

Linearity : While most real world data is non-linear, a linear assumption still works reasonably well in
most cases. For systems where non-linearity is a key feature, there are extensions such as kernel
PCA that are more appropriate.
High Variance = Important Information: By sorting by the largest eigenvalue and often ignoring the
lower variance principal components, we are assuming that high variance reflects more important
Gaussian distribution: PCA uses mean and variance in its construction, which implicitly assume a
Gaussian distribution.
Throughout this text Ive assumed that well always center and scale our data before performing
PCA. The scaling is not strictly required. If you dont scale the data, youll simply be incorporating
the scale of the variables into the PCA analysis. For example, if variable a and b contain nearly
identical data, but a is in minutes and b is in hours, failing to scale the data will result in a
overwhelming b, since a will contain much more variance.
Another consideration is that if the data is not scaled, then the correlation and covariance matrices
of the data will not be the same, and so you will have to choose between calculating eigenvectors on
the covariance matrix or on the correlation matrix. Usually the correlation matrix is preferable since
it is not sensitive to arbitrary units and makes comparing the results of PCA of various datasets more
meaningful. The covariance matrix is sometimes chosen because then the statistical inference of
the results of PCA on a sample to a broader population is more effective.
The biggest problem with Principal Component Analysis is interpretation. The principal components
themselves will often have no intrinsic meaning or intuition behind them. In analysis involving many
complex variables, the eigenvectors will be complex and the meaning often obscure. In our
example, PC 1 was very highly correlated to height, but PC 3 could not be intuitively interpreted.


Appendix #1 - Algebraic Support for Eigenvectors

The goal of PCA is to replace original dataset X with a new dataset Y that contains uncorrelated
variables, i.e. the covariance matrix of Y, S Y must be diagonalized. First, note that the covariance
matrix of Y can be calculated as 1/(n1)YYT. In PCA, were looking to find some orthonormal matrix P
where Y = PX such that the covariance matrix of Y is diagonalized. The rows of P are the principal
components of X.
We begin by rewriting SY in terms of our variable of choice P.
SY = 1/(n 1)YYT
= 1/(n 1) (PX)(PX) T
= 1/(n 1)PXXT P T
= 1/(n 1) P(XXT )P T
SY = 1/(n 1)PAPT
We now define a new matrix A=XXT. A is diagonalized by an orthogonal matrix of its eigenvectors.
A = EDET, were D is a diagonal matrix, and E is a matrix of eigenvectors of A. Now we define P as E T
and find that A=PTDP.
When we then continue evaluating SY, we can find proof that the choice of P diagonalizes S Y.
SY = 1/(n 1)PAPT
= 1/(n-1)PPTDPPT
SY =1/(n-1)D

This section comes directly from Jon Shlens, Tutorial on Principal Component Analysis (2003), an
excellent resource.


Appendix #2 Matlab Code

Start with dataset X with variables as columns and observations as rows. Center and scale the
data with the zscore function:
X2 = zscore(X)
The most direct way to perform pca is with Matlabs aptly named pca() function.
[coeff,score,latent] = pca(X2)
This will output the eigenvectors of the covariance matrix of X2 as coeff sorted by eigenvalue, and
the principal components as score. The eigenvalues will be exported as latent.
You can break the steps of pca() down by using either the eig() or svd() functions. eig() operates
directly on the covariance matrix, not the original data.
[V D] = eig(cov(X2))
This will output the eigenvectors as V and the eigenvalues as D. The eigenvectors are transposed
and unsorted relative to the pca() output.
[U,S,V] = svd(X2)
will also output the eigenvalues as V within a singular value decomposition framework.
You can manually generate pca()s score output by simply multiplying the transposed
eigenvectors by the centered and scaled data.
X2 x coeff = score

Tutorial on Principal Component Analysis (2003) by Jon Shlens sections on linear algebra.