You are on page 1of 57

Principal Components Analysis

Outline

Data reduction
PCA vs. FA
Assumptions and other issues
Multivariate analysis in terms of eigenanalysis
PCA basics
Examples

Ockhams Razor1
One of the hallmarks of good science is
parsimony and elegance of theory
However in analysis it is also often desirable to
reduce data in some fashion in order to better
get a better understanding of it or simply for
ease of analysis
In MR this was done more
implicitly
Reduce the predictors to a single
composite
The sum of the weighted variables

Note the correlation (Multiple R) and its square

Principal Components Analysis


Conceptually the goal of PCA is to reduce the number of
variables of interest into a smaller set of components
PCA analyzes the all the variance in the in the variables
and reorganizes it into a new set of components equal to
the number of original variables
Regarding the new components:
They are independent
They decrease in the amount of variance in the originals they
account for

x First component captures most of the variance, 2nd second most and
so on until all the variance is accounted for

Only some will be retained for further study (dimension


reduction)

x Since the first few capture most of the variance they are typically of
focus

PCA vs. Factor Analysis


It is easy to make the mistake in assuming that
these are the same techniques, though in some
ways exploratory factor analysis and PCA are
similar, and in general both can be seen as factor
analytic techniques
However they are typically used for different
reasons, are not mechanically the same, nor do
they have the same underlying linear model

PCA/FA
Principal Components Analysis
Extracts all the factors underlying a set of
variables
The number of factors = the number of variables
Completely explains the variance in each variable

Factor Analysis
Analyzes only the shared variance
x Error is estimated apart from shared variance

FA vs. PCA conceptually


FA produces factors; PCA produces components
Factors cause variables; components are aggregates of
the variables
The underlying causal model is fundamentally distinct
between the two
Some do not consider PCA as part of the FA family1

FA

I1

I2

PCA

I3

I1

I2

I3

Contrasting the underlying models


PCA

Extraction is the process of forming PCs as linear


combinations of the measured variables as we have
done with our other techniques
x PC1 = b11X1 + b21X2 + + bk1Xk
x PC2 = b12X1 + b22X2 + + bk2Xk
x PCf = b1fX1 + b2fX2 + + bkfXk

Common factor model

Each measure X has two contributing sources of


variation: the common factor and the specific or
unique factor :
x X 1 = 1 + 1
x X2 = 2 + 2
x Xf = f + f

FA vs. PCA
PCA
PCA is mathematically precise in orthogonalizing
dimensions
PCA redistributes all variance into orthogonal components
PCA uses all variable variance and treats it as true variance

FA
FA distributes common variance into orthogonal factors
FA is conceptually realistic in identifying common factors
FA recognizes measurement error and true factor variance

FA vs. PCA
In some sense, PCA and FA are not so different
conceptually than what we have been doing
since multiple regression
Creating linear combinations
PCA especially falls more along the line of what weve
already been doing

What we do have different from previous


methods is that there is no IV/DV distinction
Just a single set of variables

FA vs. PCA Summary


PCA goal is to analyze variance
and reduce the observed
variables
PCA reproduces the R matrix
perfectly
PCA the goal is to extract as
much variance with the fewest
components
PCA gives a unique solution

FA analyzes covariance
(communality)
FA is a close approximation to
the R matrix
FA the goal is to explain as
much of the covariance with a
minimum number of factors
that are tied specifically to
assumed constructs
FA can give multiple solutions
depending on the method and
the estimates of communality

Questions Regarding PCA


Which components account for the most
variance?
How well does the component structure fit a
given theory?
What would each subjects score be if they could
be measured directly on the components?
What is the percentage of variance in the data
accounted for by the components?

Assumptions/Issues
Assumes reliable variables/correlations
Very much affected by missing data, outlying cases and
truncated data
Data screening methods (e.g. transformations, etc.) may
improve poor factor analytic results

Normality
Univariate - normally distributed variables make the
solution stronger but not necessary if we are using the
analysis in a purely descriptive manner
Multivariate is assumed when assessing the number of
factors

Assumptions/Issues
No outliers
Influence on correlations would bias results

Variables as outliers

Some variables dont work


Explain very little variance
Relates poorly with primary components
Low squared multiple correlations as DV with
other items as predictors
Low loadings

Assumptions/Issues
Factorable R matrix
Need inter-item/variable correlations > .30 or PCA/FA isnt going to do
much for you
Large inter-item correlations do not guarantee a solution either
x While two variables may be highly correlated, they may not be correlated with
others

Matrix of partials adjusted for other variables, Kaisers measure of


sampling adequacy can help assess.
x Kaisers is a ratio of the sum of squared correlations to the sum of squared
correlations plus sum of squared partial correlations
x Approaches 1 if partials are small, and typically desire or about .6+

Multicollinearity/Singularity
In traditional PCA it is not problem; no matrix inversion is necessary
x As such it is a solution to dealing with collinearity in regression

Investigate tolerances, det(R)

Assumptions/Issues
Sample Size and Missing Data
True missing data are handled in the usual ways
Factor analysis via Maximum Likelihood needs large samples and it is one of
the only drawbacks

The more reliable the correlations are, the smaller the number of
subjects needed
Need enough subjects for stable estimates

How many?
Depends on the nature of the data and the number of parameters to be
estimated
x For example, a simple setting with few variables and clean data might not need as
much
x Having several hundred data points for a more complex solution with messy data
with lower correlations among the variables might not provide a meaningful result
(PCA) or even converge upon a solution (FA)

Other issues
No readily defined criteria by which to judge
outcome
Before we had R2 for example

Choice of rotations is dependent entirely on


researchers estimation of interpretability
Often used when other outcomes/ analyses are
not so hot, just to have something to talk about1

Extraction Methods
PCA
Extracts maximum variance with each component
First component is a linear combination of variables
that maximizes component score variance for the cases
The second (etc.) extracts the max. variance from the
residual matrix left over after extracting the first
component (therefore orthogonal to the first)
If all components retained, all variance explained

PCA
Components are linear combinations of variables.
These combinations are based on weights (eigenvectors) developed by the analysis

As we will see later PCA is not much different than canonical


correlation in terms of generating canonical variates from linear
combinations of variables

Although in PCA there are no sides of the equation, and youre not necessarily
correlating the factors, components, variates, etc.

The loading for each item/variable is the correlation between it and


the component (i.e., the underlying shared variance)
However, unlike many of the analyses you are exposed to there is no
statistical criterion to compare the linear combination to
In MANOVA we create linear combinations that maximally differentiate groups
In Canonical correlation one linear combination is used to maximally correlate
with another
PCA is a form of unsupervised learning

PCA
With multivariate research we come to eigenvalues and
eignenvectors
Eigenvalues
Conceptually can be considered to measure the strength (relative length)
of an axis in N-dimensional space
Derived via eigenanalysis of the square symmetric matrix
x The covariance or correlation matrix

Eigenvector
Each eigenvalue has an associated eigenvector. While an eigenvalue is
the length of an axis, the eigenvector determines its orientation in space.
The values in an eigenvector are not unique because any coordinates that
described the same orientation would be acceptable.

Data
Example data of womens height and weight
height
57
58
60
59
61
60
62
61
62
63
62
64
63
65
64
66
67
66
68
69

weight
93
110
99
111
115
122
110
116
122
128
134
117
123
129
135
128
135
148
142
155

Zheight
Zweight
-1.77427146053986
-1.47097719378091
-.86438866026301
-1.16768292702196
-.561094393504058
-.86438866026301
-.257800126745107
-.561094393504058
-.257800126745107
.0454941400138444
-.257800126745107
.348788406772796
.0454941400138444
.652082673531747
.348788406772796
.955376940290699
1.25867120704965
.955376940290699
1.5619654738086
1.86525974056755

-1.96516286068824
-.873405715861441
-1.5798368095729
-.809184707342218
-.552300673265324
-.102753613630758
-.873405715861441
-.4880796647461
-.102753613630758
.282572437484583
.667898488599925
-.423858656226876
-.0385326051115347
.346793446003807
.732119497119148
.282572437484583
.732119497119148
1.56699260786906
1.18166655675371
2.01653966750362

Data transformation
Consider two variables height and weight
X would be our data matrix, w our eigenvector
(coefficients)
Multiplying our original data by these weights1
results in a column vector of values
z1 = Xw

The multiplying of a matrix by a vector results in


a linear combination
The variance of this linear combination is the
eigenvalue

Data transformation
Consider a woman 5 and 122 pounds
She is -.86sd from the mean height and -.10 sd
from the mean weight for this data

b
a ' b = (a1a2 ) 1 = a1b1 + a2b2
b2
The first eigenvector associated with the
normalized data1 is [.707,.707], as such the
resulting value for that data point is -.68
So with the top graph we have taken the
original data point and projected it onto a new
axis -.68 units from the origin
Now if we do this for all data points we will
have projected them onto a new
axis/component/dimension/factor/linear
combination
The length of the new axis is the eigenvalue

Data transformation
Suppose we have more than one
dimension/factor?
In our discussion of the techniques thus far,
we have said that each component or
dimension is independent of the previous
one
What does independent mean?

r=0

What does this mean geometrically in the


multivariate sense?
It means that the next axis specified is
perpendicular to the previous
Note how r is represented even here
The cosine of the 90o angle formed by the
two axes is 0
Had the lines been on top of each other (i.e.
perfectly correlated) the angle formed by
them would be zero, whose cosine is 1

r=1

Data transformation
The other eigenvector associated with
the data is (-.707,.707)
Doing as we did before wed create that
second axis, and then could plot the data
points along these new axes1
We now have two linear combinations,
each of which is interpretable as the
vector comprised of projections of
original data points onto a directed line
segment
Note how the basic shape of the original
data has been perfectly maintained
The effect has been to rotate the
configuration (45o) to a new orientation
while preserving its essential size and
shape
It is an orthogonal transformation
Note that we have been talking of
specifiying/rotating axes, but rotating the
points themselves would give us the same
result

Meaning of Principal Components


Component analyses are those that are based
on the full correlation matrix
1.00s in the diagonal

Principal analyses are those for which each


successive factor...
accounts for maximum available variance
is orthogonal (uncorrelated, independent) with all prior
factors
full solution (as many factors as variables), i.e. accounts for
all the variance

Application of PC analysis
Components analysis is a kind of data reduction
start with an inter-related set of measured variables
identify a smaller set of composite variables that can be constructed
from the measured variables and that carry as much of their
information as possible

A Full components solution ...


has as many components as variables
accounts for 100% of the variables variance
each variable has a final communality of 1.00

A Truncated components solution


has fewer components than variables
accounts for <100% of the variables variance
each variable has a communality < 1.00

The steps of a PC analysis


Compute the correlation matrix
Extract a full components solution
Determine the number of components to keep

total variance accounted for


variable communalities
interpretability
replicability

Rotate the components and interpret (name) them


Compute component scores
Apply components solution
theoretically -- understand the meaning of the data reduction
statistically -- use the component scores in other analyses

PC Extraction
Extraction is the process of forming PCs as linear combinations of
the measured variables as we have done with our other techniques
x PC1 = b11X1 + b21X2 + + bk1Xk
x PC2 = b12X1 + b22X2 + + bk2Xk
x PCf = b1fX1 + b2fX2 + + bkfXk

The goal is to reproduce as much of the information in the measured


variables with as few PCs as possible
Heres the thing to remember
We usually perform factor analyses to find out how many groups of
related variables there are, however
The mathematical goal of extraction is to reproduce the variables
variance, efficiently

3 variable example
Consider 3 variables with
the correlations displayed
In a 3d sense we might
envision their
relationship as this, with
the shadows what the
scatterplots would
roughly look like for each
bivariate relationship
X1

X3

X2

The first component identified

The variance of this component, its eigenvalue, is 2.063


In other words it accounts for twice as much variance as
any single variable1
Note 3 variables 2.063/3 = .688% variance accounted
for by this first component1

PCA
In principal components, we extract as many
components as there are variables
As mentioned previously, each component by default is
uncorrelated with the previous
If we save the component scores and were to look at their
graph it would resemble something like this

How do we interpret the components?


The component loadings can inform us as
to their interpretation
They are the original variables correlation
with the component
In this case, all load nicely on the first
component, which since the others do not
account for nearly as much variance is
probably the only one to interpret
Depending on the type of PCA, the rotation
etc. you may see different loadings
although often the general pattern will
remain
With PCA as much the overall pattern to be
considered relative to sign or absolute
values
Which variables load on to which
components in general?

Here is an example of
magazine readership from the
chapter handout
Underlined loadings are > .30
How might this be
interpreted?

Applied example
Six items
Three sadness, three relationship quality
N = 300

PCA

Start with the Correlation Matrix

Communalities are Estimated


A measure of how much variance of the original
variables is accounted for by the observed
components/factors
Uniqueness is 1-communality
With PC with all factors (as opposed to a
truncated solution), communality will always
equal 1
Why 1.0?
PCA analyzes all the variance for each variable

As well see with FA, the approach will be


different

The initial value is the multiple R2 for the


association between a item and all the other items
in the model
FA analyzes shared variance

What are we looking for?


Any factor whose eigenvalue is less than 1.0 is in most
cases not going to be retained for interpretation
Unless it is very close or has a readily understood and
interesting meaning

Loadings1 that are:

more than .5 are typically considered strong


between .3 and .5 are acceptable
Less than .3 are typically considered weak

Matrix reproduction

All the information about the correlation matrix is


maintained
Correlations can be reproduced exactly in PCA
x Sum of cross loadings

Assessing the variance accounted for


Eigenvalue/N of items
or variables

Eigenvalue is an index of the strength of the component, the amount of


variance it accounts for.
It is also the sum of the squared loadings for that component

Loadings

Eigenvalue of factor 1 =
.6092 + .6142 .5932 + .7282 + .7672 + .7642 = 2.80

Reproducing the correlation matrix (R)


Sum the products of the loadings for two variables on all factors

For RQ1 and RQ2:


(.61 * .61) + (.61 * .57) + (-.12 * -.41) + (-.45 * .33) + .06 * .05) + (.20 * -.16) = .59
If we just kept to the first the first two factors only, the reproduced correlation = .72

Note that an index of the quality of a factor analysis (as opposed to PCA) is the extent to
which the factor loadings can reproduce the correlation matrix1. with PCA, the correlation
is reproduced exactly if all components are retained, however when we dont, we can use a
similar approach to fit.

Original correlation

Variance Accounted For


For Items

The sum of the square of the loadings (i.e., weights) across


the components is the amount of variance accounted for in
each item.
Item 1:
x .612 + .612 + -.122 + .452 + .062 + .202 =
x .37 + .37 + .015 + .2 + .004 + .04 = ~1.0

For the first two factors: .74

For components

How much variance is accounted for by the components


that will be retained?

When is it appropriate to use PCA?


PCA is largely a descriptive procedure
In our examples, we are looking at variables with decent
correlations. However, if the variables are largely
uncorrelated PCA wont do much for you

May just provide components that are respective of each


individual variable i.e. nothing is gained

One may use Bartletts sphericity test to determine


whether such an approach is appropriate
It tests the null hypothesis that the R matrix is an
identity matrix (ones on the diagonal, zer0s offdiagonals)
When the determinant of R is small (recall from before
this implies strong correlation), the chi-square statistic
will be large reject H0 and PCA would be appropriate
for data reduction
One should note though that it is a powerful test, and
usually will result in rejection with typical sample sizes
One may instead refer to estimation of practical effect
rather than a statistical test

Are the correlations worthwhile?

p2 p
2 p + 5

ln R
= (n 1)

p2 p
= df
2
p = number of variables
n = number of observations
ln R = natural log of the determinant of R

How should the data be scaled?


In most of our examples we have been using the R
matrix instead of the var-covar matrix
As PCA seeks to maximize variance, it can be sensitive to
scale differences across variables
Variables with a larger range of scores would thus have
more of an impact on the linear combination created
As such, the R matrix will typically be used, except
perhaps in cases where the items are on the same scale
(e.g. Likert)
The values involved will change (e.g. eigenvalues),
though the general interpretation may not

How many components should be retained?


There are many ways to determine
this1

"Solving the number of factors problem


is easy, I do it everyday before breakfast.
But knowing the right solution is
harder" (Kaiser)

Kaisers Rule

What weve already suggested i.e.


eigenvalues over 1
The idea is that any component should
account for at least as much as a single
variable

Another perspective on this is to retain


as many components as will account
for X% of variance in the original
variables

Practical approach

Scree Plot

Look for the elbow2


x

Look for the point after which the remaining


eigenvalues decrease in linear fashion and retain
only those above the elbow

Not really a good primary approach though


may be consistent with others

Chi-square

Null hypothesis is that X number of


components is sufficient
Want nonsignificant result

Horns Procedure

This is a different approach which


suggests to create a set of random data
of the same size N and p variables
The idea is that in this maximizing
variance accounted for, PCA has a good
chance of capitalization on chance
Even with random data, the first
eigenvalue will be > 1
As such, retain components with
eigenvalues greater than that produced
by the largest component of the random
data

Rotation
Sometimes our loadings will be a little difficult
to interpret initially
Given such a case we can rotate the solution
such that the loadings perhaps make more sense
This is typically done in factor analysis but is possible
here too

An orthogonal rotation is just a shift to a new set


of coordinate axes in the same space spanned by
the principal components

Rotation
You can think of it as shifting the axes or
rotating the egg in our previous graphic
The gist is that the relations among the items is
maintained, while maximizing their more
natural loadings and minimizing off-loadings1
Note that as PCA is a technique that initially
creates independent components, and
orthogonal rotations that maintain this
independence are typically used
Loadings will be either large or small, little in between

Varimax is the common rotation utilized

Maximizes the sum of the squared loadings for each


component

Other issues: How do we assess validity?


Usual suspects
Cross-validation
Holdout sample as we have discussed before
About a 2/3, 1/3 split
Using eigenvectors from the original components, we can create new components
with the new data and see how much variance each accounts for
Hope its similar to original solution

Jackknife

With smaller samples conduct PCA multiple times each with a specific case held
out
Using the eigenvectors, calculate the component score for the value held out
Compare the eigenvalues for the components involved

Bootstrap

In the absence of a hold out sample, we can create a bootstrapped sample to


perform the same function

Other issues:
Factoring items vs. factoring scales1
Items are often factored as part of the process of
scale development
Check if the items go together like the scales
author thinks
Scales (composites of items) are factored to
examine construct validity of new scales
test theory about what constructs are interrelated

Remember, the reason we have scales is that


individual items are typically unreliable and
have limited validity

Other issues:
Factoring items vs. factoring scales
The limited reliability and validity of items
means that they will be measured with less
precision, and so, their intercorrelations for any
one sample will be fraught with error
Since factoring starts with R, factorings of items
is likely to yield spurious solutions -- replication
of item-level factoring is very important!
Is the issue really items vs. scales ?
No -- it is really the reliability and validity of the things
being factored, scales having these properties more than
items

Other issues:
When is it appropriate to use PCA?
Another reason to use PCA, which isnt a great one
obviously, is that the maximum likelihood test
involved in and Exploratory Factor Analysis does
not converge
PCA will always give a result (it does not require
matrix inversion) and so can often be used in such a
situation
Well talk more on this later, but in data reduction
situations EFA is typically to be preferred for social
scientists and others that use imprecise measures

Other issues:
Selecting Variables for Analysis
Sometimes a researcher has access to a data set that
someone else has collected -- an opportunistic data set
While this can be a real money/time saver, be sure to
recognize the possible limitations
Be sure the sample represents a population you want to
talk about
Carefully consider variables that arent included and
the possible effects their absence has on the resulting
factors

this is especially true if the data set was chosen to be efficient variables
chosen to cover several domains

You should plan to replicate any results obtained from


opportunistic data

Other issues:
Selecting the Sample for Analysis
How many?
Keep in mind that the R and so the factor solution is the
same no matter how many cases are used -- so the point
is the representativeness and stability of the correlation
Advice about the subject/variable ration varies pretty
dramatically
5-10 cases per variable
300 cases minimum (maybe + # of items)1

Consider that like for other statistics, your standard


error for correlation decreases with increasing sample
size

A note about SPSS


SPSS does provide a means for principal
components analysis
However, its presentation (much like many
textbooks for that matter) blurs the distinction
between PCA and FA, such that they are easily
confused
Although they are both data dimension
reduction techniques, they do go about the
process differently, have different implications
regarding the results and can even come to
different conclusions

A note about SPSS


In SPSS, the menu is factor analysis (even though
principal components is the default technique setting)
Unlike other programs PCA isnt even a separate
procedure (its all in the Factor syntax)
In order to perform PCA, make sure you have principal
components selected as your extraction method, analyze
the correlation matrix, and specify the number of factors
to be extracted equals the number of variables
Even now, your loadings will be different from other
programs, which are scaled such that the sum of their
squared values = 1
In general be cautious when using SPSS

PCA in R1
Package name

Function name

base

princomp

psych

principal
VSS

pcaMethods

As the name implies this package is all about PCA, and from a modern
approach. Will automatically estimate missing values (via traditional,
robust, or Bayesian methods) and is useful just for that for any analysis.
pca
Q2 for cross validation

FactoMiner R-commander plugin


PCA

You might also like