Principal - Components - Analysis

Principal Components Analysis
Outline
Data reduction
PCA vs. FA
Assumptions and other issues
Multivariate analysis in terms of eigenanalysis
PCA basics
Examples
Ockhams Razor1
One of the hallmarks of good science is
parsimony and elegance of theory
However in analysis it is also often desirable to
reduce data in some fashion in order to better
get a better understanding of it or simply for
ease of analysis
In MR this was done more
implicitly
Reduce the predictors to a single
composite
The sum of the weighted variables
Note the correlation (Multiple R) and its square

Conceptually the goal of PCA is to reduce the number of
variables of interest into a smaller set of components
PCA analyzes the all the variance in the in the variables
and reorganizes it into a new set of components equal to
the number of original variables
Regarding the new components:
They are independent
They decrease in the amount of variance in the originals they
account for
x First component captures most of the variance, 2nd second most and
so on until all the variance is accounted for
Only some will be retained for further study (dimension

reduction)
x Since the first few capture most of the variance they are typically of
focus
PCA vs. Factor Analysis

It is easy to make the mistake in assuming that
these are the same techniques, though in some
ways exploratory factor analysis and PCA are
similar, and in general both can be seen as factor
analytic techniques
However they are typically used for different
reasons, are not mechanically the same, nor do
they have the same underlying linear model
PCA/FA
Extracts all the factors underlying a set of
variables
The number of factors = the number of variables
Completely explains the variance in each variable
Factor Analysis
Analyzes only the shared variance
x Error is estimated apart from shared variance
FA vs. PCA conceptually

FA produces factors; PCA produces components
Factors cause variables; components are aggregates of
the variables
The underlying causal model is fundamentally distinct
between the two
Some do not consider PCA as part of the FA family1
FA
I1
I2
PCA
I3
I1
I2
I3
Contrasting the underlying models

PCA
Extraction is the process of forming PCs as linear

combinations of the measured variables as we have
done with our other techniques
x PC1 = b11X1 + b21X2 + + bk1Xk
x PC2 = b12X1 + b22X2 + + bk2Xk
x PCf = b1fX1 + b2fX2 + + bkfXk
Common factor model
Each measure X has two contributing sources of

variation: the common factor and the specific or
unique factor :
x X 1 = 1 + 1
x X2 = 2 + 2
x Xf = f + f
FA vs. PCA
PCA
PCA is mathematically precise in orthogonalizing
dimensions
PCA redistributes all variance into orthogonal components
PCA uses all variable variance and treats it as true variance
FA
FA distributes common variance into orthogonal factors
FA is conceptually realistic in identifying common factors
FA recognizes measurement error and true factor variance
FA vs. PCA
In some sense, PCA and FA are not so different
conceptually than what we have been doing
since multiple regression
Creating linear combinations
PCA especially falls more along the line of what weve
already been doing
What we do have different from previous

methods is that there is no IV/DV distinction
Just a single set of variables
FA vs. PCA Summary

PCA goal is to analyze variance
and reduce the observed
variables
PCA reproduces the R matrix
perfectly
PCA the goal is to extract as
much variance with the fewest
components
PCA gives a unique solution
FA analyzes covariance
(communality)
FA is a close approximation to
the R matrix
FA the goal is to explain as
much of the covariance with a
minimum number of factors
that are tied specifically to
assumed constructs
FA can give multiple solutions
depending on the method and
the estimates of communality
Questions Regarding PCA

Which components account for the most
variance?
How well does the component structure fit a
given theory?
What would each subjects score be if they could
be measured directly on the components?
What is the percentage of variance in the data
accounted for by the components?
Assumptions/Issues
Assumes reliable variables/correlations
Very much affected by missing data, outlying cases and
truncated data
Data screening methods (e.g. transformations, etc.) may
improve poor factor analytic results
Normality
Univariate - normally distributed variables make the
solution stronger but not necessary if we are using the
analysis in a purely descriptive manner
Multivariate is assumed when assessing the number of
factors
Assumptions/Issues
No outliers
Influence on correlations would bias results
Variables as outliers
Some variables dont work

Explain very little variance
Relates poorly with primary components
Low squared multiple correlations as DV with
other items as predictors
Low loadings
Assumptions/Issues
Factorable R matrix
Need inter-item/variable correlations > .30 or PCA/FA isnt going to do
much for you
Large inter-item correlations do not guarantee a solution either
x While two variables may be highly correlated, they may not be correlated with
others
Matrix of partials adjusted for other variables, Kaisers measure of

sampling adequacy can help assess.
x Kaisers is a ratio of the sum of squared correlations to the sum of squared
correlations plus sum of squared partial correlations
x Approaches 1 if partials are small, and typically desire or about .6+
Multicollinearity/Singularity
In traditional PCA it is not problem; no matrix inversion is necessary
x As such it is a solution to dealing with collinearity in regression
Investigate tolerances, det(R)
Assumptions/Issues
Sample Size and Missing Data
True missing data are handled in the usual ways
Factor analysis via Maximum Likelihood needs large samples and it is one of
the only drawbacks
The more reliable the correlations are, the smaller the number of
subjects needed
Need enough subjects for stable estimates
How many?
Depends on the nature of the data and the number of parameters to be
estimated
x For example, a simple setting with few variables and clean data might not need as
much
x Having several hundred data points for a more complex solution with messy data
with lower correlations among the variables might not provide a meaningful result
(PCA) or even converge upon a solution (FA)
Other issues
No readily defined criteria by which to judge
outcome
Before we had R2 for example
Choice of rotations is dependent entirely on

researchers estimation of interpretability
Often used when other outcomes/ analyses are
not so hot, just to have something to talk about1
Extraction Methods
PCA
Extracts maximum variance with each component
First component is a linear combination of variables
that maximizes component score variance for the cases
The second (etc.) extracts the max. variance from the
residual matrix left over after extracting the first
component (therefore orthogonal to the first)
If all components retained, all variance explained
PCA
Components are linear combinations of variables.
These combinations are based on weights (eigenvectors) developed by the analysis
As we will see later PCA is not much different than canonical

correlation in terms of generating canonical variates from linear
combinations of variables
Although in PCA there are no sides of the equation, and youre not necessarily
correlating the factors, components, variates, etc.
The loading for each item/variable is the correlation between it and

the component (i.e., the underlying shared variance)
However, unlike many of the analyses you are exposed to there is no
statistical criterion to compare the linear combination to
In MANOVA we create linear combinations that maximally differentiate groups
In Canonical correlation one linear combination is used to maximally correlate
with another
PCA is a form of unsupervised learning
PCA
With multivariate research we come to eigenvalues and
eignenvectors
Eigenvalues
Conceptually can be considered to measure the strength (relative length)
of an axis in N-dimensional space
Derived via eigenanalysis of the square symmetric matrix
x The covariance or correlation matrix
Eigenvector
Each eigenvalue has an associated eigenvector. While an eigenvalue is
the length of an axis, the eigenvector determines its orientation in space.
The values in an eigenvector are not unique because any coordinates that
described the same orientation would be acceptable.
Data
Example data of womens height and weight
height
57
58
60
59
61
60
62
61
62
63
62
64
63
65
64
66
67
66
68
69
weight
93
110
99
111
115
122
110
116
122
128
134
117
123
129
135
128
135
148
142
155
Zheight
Zweight
-1.77427146053986
-1.47097719378091
-.86438866026301
-1.16768292702196
-.561094393504058
-.86438866026301
-.257800126745107
-.561094393504058
-.257800126745107
.0454941400138444
-.257800126745107
.348788406772796
.0454941400138444
.652082673531747
.348788406772796
.955376940290699
1.25867120704965
.955376940290699
1.5619654738086
1.86525974056755
-1.96516286068824
-.873405715861441
-1.5798368095729
-.809184707342218
-.552300673265324
-.102753613630758
-.873405715861441
-.4880796647461
-.102753613630758
.282572437484583
.667898488599925
-.423858656226876
-.0385326051115347
.346793446003807
.732119497119148
.282572437484583
.732119497119148
1.56699260786906
1.18166655675371
2.01653966750362
Data transformation
Consider two variables height and weight
X would be our data matrix, w our eigenvector
(coefficients)
Multiplying our original data by these weights1
results in a column vector of values
z1 = Xw
The multiplying of a matrix by a vector results in

a linear combination
The variance of this linear combination is the
eigenvalue
Data transformation
Consider a woman 5 and 122 pounds
She is -.86sd from the mean height and -.10 sd
from the mean weight for this data
b
a ' b = (a1a2 ) 1 = a1b1 + a2b2
b2
The first eigenvector associated with the
normalized data1 is [.707,.707], as such the
resulting value for that data point is -.68
So with the top graph we have taken the
original data point and projected it onto a new
axis -.68 units from the origin
Now if we do this for all data points we will
have projected them onto a new
axis/component/dimension/factor/linear
combination
The length of the new axis is the eigenvalue
Data transformation
Suppose we have more than one
dimension/factor?
In our discussion of the techniques thus far,
we have said that each component or
dimension is independent of the previous
one
What does independent mean?
r=0
What does this mean geometrically in the

multivariate sense?
It means that the next axis specified is
perpendicular to the previous
Note how r is represented even here
The cosine of the 90o angle formed by the
two axes is 0
Had the lines been on top of each other (i.e.
perfectly correlated) the angle formed by
them would be zero, whose cosine is 1
r=1
Data transformation
The other eigenvector associated with
the data is (-.707,.707)
Doing as we did before wed create that
second axis, and then could plot the data
points along these new axes1
We now have two linear combinations,
each of which is interpretable as the
vector comprised of projections of
original data points onto a directed line
segment
Note how the basic shape of the original
data has been perfectly maintained
The effect has been to rotate the
configuration (45o) to a new orientation
while preserving its essential size and
shape
It is an orthogonal transformation
Note that we have been talking of
specifiying/rotating axes, but rotating the
points themselves would give us the same
result
Meaning of Principal Components

Component analyses are those that are based
on the full correlation matrix
1.00s in the diagonal
Principal analyses are those for which each

successive factor...
accounts for maximum available variance
is orthogonal (uncorrelated, independent) with all prior
factors
full solution (as many factors as variables), i.e. accounts for
all the variance
Application of PC analysis
Components analysis is a kind of data reduction
start with an inter-related set of measured variables
identify a smaller set of composite variables that can be constructed
from the measured variables and that carry as much of their
information as possible
A Full components solution ...

has as many components as variables
accounts for 100% of the variables variance
each variable has a final communality of 1.00
A Truncated components solution

has fewer components than variables
accounts for <100% of the variables variance
each variable has a communality < 1.00
The steps of a PC analysis

Compute the correlation matrix
Extract a full components solution
Determine the number of components to keep
total variance accounted for

variable communalities
interpretability
replicability
Rotate the components and interpret (name) them

Compute component scores
Apply components solution
theoretically -- understand the meaning of the data reduction
statistically -- use the component scores in other analyses
PC Extraction
Extraction is the process of forming PCs as linear combinations of
the measured variables as we have done with our other techniques
x PC1 = b11X1 + b21X2 + + bk1Xk
x PC2 = b12X1 + b22X2 + + bk2Xk
x PCf = b1fX1 + b2fX2 + + bkfXk
The goal is to reproduce as much of the information in the measured

variables with as few PCs as possible
Heres the thing to remember
We usually perform factor analyses to find out how many groups of
related variables there are, however
The mathematical goal of extraction is to reproduce the variables
variance, efficiently
3 variable example
Consider 3 variables with
the correlations displayed
In a 3d sense we might
envision their
relationship as this, with
the shadows what the
scatterplots would
roughly look like for each
bivariate relationship
X1
X3
X2
The first component identified
The variance of this component, its eigenvalue, is 2.063

In other words it accounts for twice as much variance as
any single variable1
Note 3 variables 2.063/3 = .688% variance accounted
for by this first component1
PCA
In principal components, we extract as many
components as there are variables
As mentioned previously, each component by default is
uncorrelated with the previous
If we save the component scores and were to look at their
graph it would resemble something like this
How do we interpret the components?

The component loadings can inform us as
to their interpretation
They are the original variables correlation
with the component
In this case, all load nicely on the first
component, which since the others do not
account for nearly as much variance is
probably the only one to interpret
Depending on the type of PCA, the rotation
etc. you may see different loadings
although often the general pattern will
remain
With PCA as much the overall pattern to be
considered relative to sign or absolute
values
Which variables load on to which
components in general?
Here is an example of
magazine readership from the
chapter handout
Underlined loadings are > .30
How might this be
interpreted?
Applied example
Six items
Three sadness, three relationship quality
N = 300
PCA
Start with the Correlation Matrix
Communalities are Estimated

A measure of how much variance of the original
variables is accounted for by the observed
components/factors
Uniqueness is 1-communality
With PC with all factors (as opposed to a
truncated solution), communality will always
equal 1
Why 1.0?
PCA analyzes all the variance for each variable
As well see with FA, the approach will be

different
The initial value is the multiple R2 for the

association between a item and all the other items
in the model
FA analyzes shared variance
What are we looking for?

Any factor whose eigenvalue is less than 1.0 is in most
cases not going to be retained for interpretation
Unless it is very close or has a readily understood and
interesting meaning
Loadings1 that are:
more than .5 are typically considered strong

between .3 and .5 are acceptable
Less than .3 are typically considered weak
Matrix reproduction
All the information about the correlation matrix is

maintained
Correlations can be reproduced exactly in PCA
x Sum of cross loadings
Assessing the variance accounted for

Eigenvalue/N of items
or variables
Eigenvalue is an index of the strength of the component, the amount of

variance it accounts for.
It is also the sum of the squared loadings for that component
Loadings
Eigenvalue of factor 1 =
.6092 + .6142 .5932 + .7282 + .7672 + .7642 = 2.80
Reproducing the correlation matrix (R)

Sum the products of the loadings for two variables on all factors
For RQ1 and RQ2:

(.61 * .61) + (.61 * .57) + (-.12 * -.41) + (-.45 * .33) + .06 * .05) + (.20 * -.16) = .59
If we just kept to the first the first two factors only, the reproduced correlation = .72
Note that an index of the quality of a factor analysis (as opposed to PCA) is the extent to
which the factor loadings can reproduce the correlation matrix1. with PCA, the correlation
is reproduced exactly if all components are retained, however when we dont, we can use a
similar approach to fit.
Original correlation
Variance Accounted For

For Items
The sum of the square of the loadings (i.e., weights) across

the components is the amount of variance accounted for in
each item.
Item 1:
x .612 + .612 + -.122 + .452 + .062 + .202 =
x .37 + .37 + .015 + .2 + .004 + .04 = ~1.0
For the first two factors: .74
For components
How much variance is accounted for by the components

that will be retained?
When is it appropriate to use PCA?

PCA is largely a descriptive procedure
In our examples, we are looking at variables with decent
correlations. However, if the variables are largely
uncorrelated PCA wont do much for you
May just provide components that are respective of each

individual variable i.e. nothing is gained
One may use Bartletts sphericity test to determine

whether such an approach is appropriate
It tests the null hypothesis that the R matrix is an
identity matrix (ones on the diagonal, zer0s offdiagonals)
When the determinant of R is small (recall from before
this implies strong correlation), the chi-square statistic
will be large reject H0 and PCA would be appropriate
for data reduction
One should note though that it is a powerful test, and
usually will result in rejection with typical sample sizes
One may instead refer to estimation of practical effect
rather than a statistical test
Are the correlations worthwhile?
p2 p
2 p + 5
ln R
= (n 1)
p2 p
= df
2
p = number of variables
n = number of observations
ln R = natural log of the determinant of R
How should the data be scaled?

In most of our examples we have been using the R
matrix instead of the var-covar matrix
As PCA seeks to maximize variance, it can be sensitive to
scale differences across variables
Variables with a larger range of scores would thus have
more of an impact on the linear combination created
As such, the R matrix will typically be used, except
perhaps in cases where the items are on the same scale
(e.g. Likert)
The values involved will change (e.g. eigenvalues),
though the general interpretation may not
How many components should be retained?

There are many ways to determine
this1
"Solving the number of factors problem

is easy, I do it everyday before breakfast.
But knowing the right solution is
harder" (Kaiser)
Kaisers Rule
What weve already suggested i.e.

eigenvalues over 1
The idea is that any component should
account for at least as much as a single
variable
Another perspective on this is to retain

as many components as will account
for X% of variance in the original
variables
Practical approach
Scree Plot
Look for the elbow2

x
Look for the point after which the remaining

eigenvalues decrease in linear fashion and retain
only those above the elbow
Not really a good primary approach though

may be consistent with others
Chi-square
Null hypothesis is that X number of

components is sufficient
Want nonsignificant result
Horns Procedure
This is a different approach which

suggests to create a set of random data
of the same size N and p variables
The idea is that in this maximizing
variance accounted for, PCA has a good
chance of capitalization on chance
Even with random data, the first
eigenvalue will be > 1
As such, retain components with
eigenvalues greater than that produced
by the largest component of the random
data
Rotation
Sometimes our loadings will be a little difficult
to interpret initially
Given such a case we can rotate the solution
such that the loadings perhaps make more sense
This is typically done in factor analysis but is possible
here too
An orthogonal rotation is just a shift to a new set

of coordinate axes in the same space spanned by
the principal components
Rotation
You can think of it as shifting the axes or
rotating the egg in our previous graphic
The gist is that the relations among the items is
maintained, while maximizing their more
natural loadings and minimizing off-loadings1
Note that as PCA is a technique that initially
creates independent components, and
orthogonal rotations that maintain this
independence are typically used
Loadings will be either large or small, little in between
Varimax is the common rotation utilized
Maximizes the sum of the squared loadings for each

component
Other issues: How do we assess validity?

Usual suspects
Cross-validation
Holdout sample as we have discussed before
About a 2/3, 1/3 split
Using eigenvectors from the original components, we can create new components
with the new data and see how much variance each accounts for
Hope its similar to original solution
Jackknife
With smaller samples conduct PCA multiple times each with a specific case held
out
Using the eigenvectors, calculate the component score for the value held out
Compare the eigenvalues for the components involved
Bootstrap
In the absence of a hold out sample, we can create a bootstrapped sample to

perform the same function
Other issues:
Factoring items vs. factoring scales1
Items are often factored as part of the process of
scale development
Check if the items go together like the scales
author thinks
Scales (composites of items) are factored to
examine construct validity of new scales
test theory about what constructs are interrelated
Remember, the reason we have scales is that

individual items are typically unreliable and
have limited validity
Other issues:
Factoring items vs. factoring scales
The limited reliability and validity of items
means that they will be measured with less
precision, and so, their intercorrelations for any
one sample will be fraught with error
Since factoring starts with R, factorings of items
is likely to yield spurious solutions -- replication
of item-level factoring is very important!
Is the issue really items vs. scales ?
No -- it is really the reliability and validity of the things
being factored, scales having these properties more than
items
Other issues:
When is it appropriate to use PCA?
Another reason to use PCA, which isnt a great one
obviously, is that the maximum likelihood test
involved in and Exploratory Factor Analysis does
not converge
PCA will always give a result (it does not require
matrix inversion) and so can often be used in such a
situation
Well talk more on this later, but in data reduction
situations EFA is typically to be preferred for social
scientists and others that use imprecise measures
Other issues:
Selecting Variables for Analysis
Sometimes a researcher has access to a data set that
someone else has collected -- an opportunistic data set
While this can be a real money/time saver, be sure to
recognize the possible limitations
Be sure the sample represents a population you want to
talk about
Carefully consider variables that arent included and
the possible effects their absence has on the resulting
factors
this is especially true if the data set was chosen to be efficient variables
chosen to cover several domains
You should plan to replicate any results obtained from

opportunistic data
Other issues:
Selecting the Sample for Analysis
How many?
Keep in mind that the R and so the factor solution is the
same no matter how many cases are used -- so the point
is the representativeness and stability of the correlation
Advice about the subject/variable ration varies pretty
dramatically
5-10 cases per variable
300 cases minimum (maybe + # of items)1
Consider that like for other statistics, your standard

error for correlation decreases with increasing sample
size
A note about SPSS

SPSS does provide a means for principal
components analysis
However, its presentation (much like many
textbooks for that matter) blurs the distinction
between PCA and FA, such that they are easily
confused
Although they are both data dimension
reduction techniques, they do go about the
process differently, have different implications
regarding the results and can even come to
different conclusions
A note about SPSS

In SPSS, the menu is factor analysis (even though
principal components is the default technique setting)
Unlike other programs PCA isnt even a separate
procedure (its all in the Factor syntax)
In order to perform PCA, make sure you have principal
components selected as your extraction method, analyze
the correlation matrix, and specify the number of factors
to be extracted equals the number of variables
Even now, your loadings will be different from other
programs, which are scaled such that the sum of their
squared values = 1
In general be cautious when using SPSS
PCA in R1
Package name
Function name
base
princomp
psych
principal
VSS
pcaMethods
As the name implies this package is all about PCA, and from a modern
approach. Will automatically estimate missing values (via traditional,
robust, or Bayesian methods) and is useful just for that for any analysis.
pca
Q2 for cross validation
FactoMiner R-commander plugin

PCA

Principal - Components - Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principal - Components - Analysis

Uploaded by

Copyright:

Available Formats

Principal Components Analysis

Note the correlation (Multiple R) and its square

Principal Components Analysis

Only some will be retained for further study (dimension

PCA vs. Factor Analysis

FA vs. PCA conceptually

Contrasting the underlying models

Extraction is the process of forming PCs as linear

Common factor model

Each measure X has two contributing sources of

What we do have different from previous

FA vs. PCA Summary

Questions Regarding PCA

Some variables dont work

Matrix of partials adjusted for other variables, Kaisers measure of

Investigate tolerances, det(R)

Choice of rotations is dependent entirely on

As we will see later PCA is not much different than canonical

The loading for each item/variable is the correlation between it and

The multiplying of a matrix by a vector results in

What does this mean geometrically in the

Meaning of Principal Components

Principal analyses are those for which each

A Full components solution ...

A Truncated components solution

The steps of a PC analysis

total variance accounted for

Rotate the components and interpret (name) them

The goal is to reproduce as much of the information in the measured

The first component identified

The variance of this component, its eigenvalue, is 2.063

How do we interpret the components?

Start with the Correlation Matrix

Communalities are Estimated

As well see with FA, the approach will be

The initial value is the multiple R2 for the

What are we looking for?

Loadings1 that are:

more than .5 are typically considered strong

All the information about the correlation matrix is

Assessing the variance accounted for

Eigenvalue is an index of the strength of the component, the amount of

Reproducing the correlation matrix (R)

For RQ1 and RQ2:

Variance Accounted For

The sum of the square of the loadings (i.e., weights) across

For the first two factors: .74

How much variance is accounted for by the components

When is it appropriate to use PCA?

May just provide components that are respective of each

One may use Bartletts sphericity test to determine

Are the correlations worthwhile?

How should the data be scaled?

How many components should be retained?

"Solving the number of factors problem

What weve already suggested i.e.

Another perspective on this is to retain

Look for the elbow2

Look for the point after which the remaining

Not really a good primary approach though

Null hypothesis is that X number of

This is a different approach which

An orthogonal rotation is just a shift to a new set