Pcatutorial PDF

PRINCIPAL COMPONENTS
ANALYSIS (PCA)
Steven M. Ho!and
Department of Geology, University of Georgia, Athens, GA 30602-2501
May 2008
Introduction
Suppose we had measured two variables, length and width, and plotted them as shown below.
Both variables have approximately the same variance and they are highly correlated with one
another. We could pass a vector through the long axis of the cloud of points and a second
vector at right angles to the first, with both vectors passing through the centroid of the data.
Once we have made these vectors, we could find the coordinates of all of the data points relative to these two perpendicular vectors and replot the data, as shown here (both of these figures are from Swan and Sandilands, 1995).
In this new reference frame, note that variance is greater along axis 1 than it is on axis 2. Also
note that the spatial relationships of the points are unchanged; this process has merely rotated the data. Finally, note that our new vectors, or axes, are uncorrelated. By performing
such a rotation, the new axes might have particular explanations. In this case, axis 1 could be
regarded as a size measure, with samples on the left having both small length and width and
samples on the right having large length and width. Axis 2 could be regarded as a measure of
shape, with samples at any axis 1 position (that is, of a given size) having dierent length to
width ratios. PC axes will generally not coincide exactly with any of the original variables.
Although these relationships may seem obvious, when one is dealing with many variables, this
process allows one to assess much more quickly any relationships among variables. For data
Principal Components Analysis
sets with many variables, the variance of some axes may be great, whereas others may be
small, such that they can be ignored. This is known as reducing the dimensionality of a data
set, such that one might start with thirty original variables, but might end with only two or
three meaningful axes. The formal name for this approach of rotating data such that each
successive axis displays a decreasing among of variance is known as Principal Components
Analysis, or PCA. PCA produces linear combinations of the original variables to generate the
axes, also known as principal components, or PCs.
Computation
Given a data matrix with p variables and n samples, the data are first centered on the means
of each variable. This will insure that the cloud of data is centered on the origin of our principal components, but does not aect the spatial relationships of the data nor the variances
along our variables. The first principal components (Y1) is given by the linear combination of
the variables X1, X2, ...,Xp
Y1 = a11 X1 + a12 X2 + ... + a1p Xp

or, in matrix notation
Y1 = aT1 X
The first principal component is calculated such that it accounts for the greatest possible
variance in the data set. Of course, one could make the variance of Y1 as large as possible by
choosing large values for the weights a11, a12, ... a1p. To prevent this, weights are calculated
with the constraint that their sum of squares is 1.
a211 + a212 + ... + a21p = 1

The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts
for the next highest variance.
Y2 = a21 X1 + a22 X2 + ... + a2p Xp

This continues until a total of p principal components have been calculated, equal to the
original number of variables. At this point, the sum of the variances of all of the principal
components will equal the sum of the variances of all of the variables, that is, all of the original information has been explained or accounted for. Collectively, all of these transformations of the original variables to the principal components is
Y = AX
Calculating these transformations or weights requires a computer for all all but the smallest
matrices. The rows of matrix A are called the eigenvectors of matrix Sx, the variancecovariance matrix of the original data. The elements of an eigenvector are the weights aij, and
are also known as loadings. The elements in the diagonal of matrix Sy, the variancecovariance matrix of the principal components, are known as the eigenvalues. Eigenvalues
are the variance explained by each principal component, and to repeat, are constrained to
decrease monotonically from the first principal component to the last. These eigenvalues are
commonly plotted on a scree plot to show the decreasing rate at which variance is explained
by additional principal components.
log(variance)
10
15
20
principle component
The positions of each observation in this new coordinate system of principal components are
called scores and are calculated as linear combinations of the original variables and the
weights aij. For example, the score for the rth sample on the kth principal component is calculated as
Ykr = ak1 xk1 + ak2 xk2 + ... + akp xkp

In interpreting the principal components, it is often useful to know the correlations of the
original variables with the principal components. The correlation of variable Xi and principal
component Yj is
rij =
!
a2ij V ar (Yj ) /sii
Because reduction of dimensionality, that is, focussing on a few principal components versus
many variables, is a goal of principal components analysis, several criteria have been proposed
for determining how many PCs should be investigated and how many should be ignored. One
common criteria is to ignore principal components at the point at which the next PC oers
little increase in the total variance explained. A second criteria is to include all those PCs up
to a predetermined total percent variance explained, such as 90%. A third standard is to ignore components whose variance explained is less than 1 when a correlation matrix is used or
less than the average variance explained when a covariance matrix is used, with the idea being
that such a PC oers less than one variables worth of information. A fourth standard is to
ignore the last PCs whose variance explained is all roughly equal.
Principal components are equivalent to major axis regressions. As such, principal components analysis is subject to the same restrictions as regression, in particular multivariate normality. The distributions of each variable should be checked for normality and transforms
used where necessary to correct high degrees of skewness in particular. Outliers should be
removed from the data set as they can dominate the results of a principal components analysis.
PCA in R
1) Do an R-mode PCA using prcomp() in R. To do a Q-mode PCA, the data set should be
transposed before proceeding. R-mode PCA examines the correlations or covariances among
variables, whereas Q-mode focusses on the correlations or covariances among samples.
> mydata <- read.table(file="mydata.txt", header=TRUE,
row.names=1, sep=",")
> mydata.pca <- prcomp(mydata, retx=TRUE, center=TRUE,
scale.=TRUE)
# variable means set to zero, and variances set to one
# sample scores stored in mydata.pca$x
# loadings stored in mydata.pca$rotation
# singular values (square roots of eigenvalues) stored
# in mydata.pca$sdev
# variable means stored in mydata.pca$center
# variable standard deviations stored in mydata.pca$scale
>
>
>
>
sd <- mydata.pca$sdev
loadings <- mydata.pca$rotation
rownames(loadings) <- colnames(mydata)
scores <- mydata.pca$x
2) Do a PCA longhand in R:
> R <- cor(mydata)
# calculate a correlation matrix
>
#
#
#
myEig <- eigen(R)

find the eigenvalues and eigenvectors of correlation matrix
eigenvalues stored in myEig$values
eigenvectors (loadings) stored in myEig$vectors
> sdLONG <- sqrt(myEig$values)

# calculating singular values from eigenvalues
> loadingsLONG <- myEig$vectors
> rownames(loadingsLONG) <- colnames(mydata)

# saving as loadings, and setting rownames
> standardize <- function(x) {(x - mean(x))/sd(x)}
> X <- apply(mydata, MARGIN=2, FUN=standardize)
# transforming data to zero mean and unit variance
> scoresLONG <- X %*% loadingsLONG
# calculating scores from eigenanalysis results
3) Compare results from the two analyses to demonstrate equivalency. Maximum dierences
should be close to zero if the two approaches are equivalent.
> range(sd - sdLONG)
> range(loadings - loadingsLONG)
> range(scores - scoresLONG)
4) Do a distance biplot (see Legendre & Legendre, 1998, p. 403)
> quartz(height=7, width=7)
> plot(scores[,1], scores[,2], xlab="PCA 1", ylab="PCA 2",
type="n", xlim=c(min(scores[,1:2]), max(scores[,1:2])),
ylim=c(min(scores[,1:2]), max(scores[,1:2])))
> arrows(0,0,loadings[,1]*10,loadings[,2]*10, length=0.1,
angle=20, col="red")
# note that this scaling factor of 10 may need to be changed,
# depending on the data set
> text(loadings[,1]*10*1.2,loadings[,2]*10*1.2,
rownames(loadings), col="red", cex=0.7)
# 1.2 scaling insures that labels are plotted just beyond
# the arrows
> text(scores[,1],scores[,2], rownames(scores), col="blue",
cex=0.7)
15
5
PCA 2
10
CJ001
CJ011
Na
B
K
Li
W
Sr As
Ca
Ba Si
Th Cd
AgTi
CJ003
CJ005
CJ045
CJ043Au
Be
CJ025
CJ041
d18O
CJ023
DH
CJ047
Cu
Sb
JK013
pH
Zn
Hg
CJ017
CJ009
Se
DC002
Tl
JK012
JK014
Mn
Co
CJ051
CJ053
Mo
Ni
Cr
Pb
Mg
Y
CJ063
U Ce
CJ027
La
CJ049
CJ007
P
V
Fe
Al
SO4
DC008 DC011
DC012
CJ015 CJ019
CJ029
CJ031
CJ039
CJ037
CJ035
CJ021
DC006
10
15
PCA 1

> biplot(scores[,1:2], loadings[,1:2], xlab=rownames(scores),
ylab=rownames(loadings), cex=0.7)
# using built-in biplot function
0.2
0.4
0.6
As
Sr
Th
Si
0.2
PC2
10
CJ001
Na
BCJ011
K
Li
W
Ba
Ca
Cd
Ag Ti
Au
Be
CJ003
CJ045
d18O CJ005
CJ043
CJ025
CJ041
CJ023
DH Cu
CJ047
JK013
CJ017CJ009
Zn Sb
pH CJ051
Hg
DC002
JK012
JK014
CJ053
Se
Tl
CJ063
CJ027
CJ049
CJ007 Mn
Co
Mo
Ni
Cr
Pb
DC008DC011
YMg
DC012
U Ce
CJ015CJ019
CJ029
La
CJ021
CJ031
CJ039
CJ037
VP
CJ035
Fe
Al
SO4
10
0.0
0.4
15
0.6
0.0
DC006
15
PC1
5) Do a correlation biplot (see Legendre & Legendre, 1998, p. 404)

> plot(scores[,1]/sd[1], scores[,2]/sd[2], xlab="PCA 1",
ylab="PCA 2", type="n")
> arrows(0,0,loadings[,1]*sd[1],loadings[,2]*sd[2],
length=0.1, angle=20, col="red")

> text(loadings[,1]*sd[1]*1.2,loadings[,2]*sd[2]*1.2,
rownames(loadings), col="red", cex=0.7)
> text(scores[,1]/sd[1],scores[,2]/sd[2], rownames(scores),
col="blue", cex=0.7)
# 1.2 scaling insures that labels are plotted just beyond
# the arrows
CJ001
PCA 2
CJ011
Na
B
K
Li
W
As
Sr
Ca
Ba
Th
CJ003
Si
Cd
Ag Ti
CJ005
CJ045
CJ043
CJ025
Au
CJ041
Be
CJ023
d18O
DH
JK013Hg
Cu
SbSe
pH CJ047
CJ017
CJ009
Zn
DC002
Tl
JK012
JK014
Mn
Co
Mo
Ni
CJ051
Cr
CJ053
Pb
Mg
Y
U
Ce
La
CJ063
VAlP
CJ027
CJ049
CJ007
Fe
SO4
CJ019
DC008 DC011
DC012
CJ015
CJ029
CJ031
CJ039
CJ037
CJ035
CJ021
DC006
PCA 1
>
>
#
#
#
#
quartz(height=7, width=7)
biplot(mydata.pca)
using built-in biplot function of prcomp(); note that only
the top and right axes match the coordinates of the points;
also note that still doesn't quite replicate the correlation
biplot. Its unclear what this function really does.
10
15
0.6
CJ001
0.2
0.4
10
0.0
Na
B
K
Li
W
Sr As
Ba Si
ThCa
CJ003
TiCJ005
AgCd
CJ045
CJ043
CJ025
Au
CJ041
Be
CJ023
d18O
CJ047
DH
Cu
Sb
pHJK013
Hg
CJ017
CJ009
Zn
Se
DC002
Tl
JK012
JK014
Mn
Co
Mo
CJ051
Ni
Cr
CJ053
Mg
Pb
Y
U SO4
Ce
La
CJ063
CJ027
P
V
CJ049
CJ007
Fe
Al
CJ019
DC008
DC012CJ021
DC011
CJ015
CJ029
CJ031
CJ039
CJ037
CJ035
DC006
0.2
0.0
PC2
0.4
15
CJ011
0.6
PC1
6) Calculate the correlation coecients between variables and principal components

> correlations <- t(loadings)*sd
# find the correlation of all the variables with all PC's
> correlations <- cor(scores,mydata)
# another way to find these correlations
7) Plot a scree graph
> plot(mydata.pca)
# using built-in function for prcomp; may not show all PCs
10
0
Variances
15
20
mydata.pca

> plot(log(sd^2), xlab="principle component",
ylab="log(variance)", type="b", pch=16)
# using a general plot, with variance on a log scale
20
40
60
log(variance)
10
15
20
25
30
35
principle component
8) Find variance along each principal component and the eigenvalues

> newsd <- sd(scores)
> max (sd - newsd)
# finds maximum difference between the standard deviation form
# prcomp and the standard deviation calculated longhand;

# should be close to zero
> eigenvalues <- sd^2
> sum(eigenvalues)
# should equal number of variables
> length(mydata)
# number of variables
9) Save loadings, scores, and singular values to files
> write.table(loadings, file="loadings.txt")
> write.table(scores, file="scores.txt")
> write.table(sd, file="sd.txt")
References
Legendre, P., and L. Legendre, 1998. Numerical Ecology. Elsevier: Amsterdam, 853 p.
Swan, A.R.H., and M. Sandilands, 1995. Introduction to Geological Data Analysis. Blackwell
Science: Oxford, 446 p.
10

Pcatutorial PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pcatutorial PDF

Uploaded by

Copyright:

Available Formats

PRINCIPAL COMPONENTS

Principal Components Analysis

Y1 = a11 X1 + a12 X2 + ... + a1p Xp

a211 + a212 + ... + a21p = 1

Y2 = a21 X1 + a22 X2 + ... + a2p Xp

Principal Components Analysis

Ykr = ak1 xk1 + ak2 xk2 + ... + akp xkp

Principal Components Analysis

myEig <- eigen(R)

> sdLONG <- sqrt(myEig$values)

Principal Components Analysis

> rownames(loadingsLONG) <- colnames(mydata)

Principal Components Analysis

> quartz(height=7, width=7)

5) Do a correlation biplot (see Legendre & Legendre, 1998, p. 404)

Principal Components Analysis

length=0.1, angle=20, col="red")

Principal Components Analysis

6) Calculate the correlation coecients between variables and principal components

Principal Components Analysis

> quartz(height=7, width=7)

8) Find variance along each principal component and the eigenvalues

Principal Components Analysis

# prcomp and the standard deviation calculated longhand;

Principal Components Analysis

You might also like