Professional Documents
Culture Documents
• Objectives:
– Understand the principles of principal
components analysis (PCA)
– Recognize conditions under which PCA
may be useful
– Use SAS procedure PRINCOMP to
• perform a principal components analysis
• interpret PRINCOMP output.
Y
-0.5
4 0.632455532 0.894427 -1
5 1.264911064 1.788854 -1.5
-2
Mean 0.0000 0.0000 -1.5 -1 -0.5 0 0.5 1 1.5
Var 1 2 X
rX ,Y
( X X )(Y Y )
5.6569
1
n
( xi x )( yi y )
48 Cov( x, y ) i 1
2
( X X ) (Y Y )
2 2
n 1
X Y Correlation matrix X Y
X 1 1 Covariance matrix X 1 1.414
Y 1 1 Y 1.414 2
Xuhua Xia Slide 5
General Patterns
• The total variance is 3 (= 1 + 2)
• The two variables, X and Y, are perfectly correlated,
with all points fall on the regression line.
• The spatial relationship among the 5 points can
therefore be represented by a single dimension.
• PCA is a dimension-reduction technique. What
would happen if we apply PCA to the data?
2
1.5
1
0.5
0
Y
-0.5
-1
-1.5
-2
-1.5 -1 -0.5 0 0.5 1 1.5
X
Xuhua Xia Slide 7
SAS Program
data pca;
input x y; Requesting the PCA to be
cards; carried out on the
-1.264911064 -1.788854382 covariance matrix rather
-0.632455532 -0.894427191 than the correlation
0 0
matrix.
0.632455532 0.894427191
1.264911064 1.788854382
;
proc princomp cov out=pcscore; Without specifying the
proc print;
var prin1 prin2; covariance option, PCA
proc princomp data=pca out=pcscore; will be carried out on the
proc print; correlation matrix.
var prin1 prin2;
run;
X 0.577350 0.816497
Y 0.816497 -.577350
Eigenvectors
PRIN1 PRIN2
X 0.707107 0.70710
Y 0.707107 -0.70711
Variance
OBS PRIN1 PRIN2
accounted 1 -1.78885 0 Principal
for by each 2 -0.89443 0 component
principal 3 0.00000 0
scores
4 0.89443 0
components 5 1.78885 0
Xuhua Xia
What’s the variance in PC1?
Slide 12
Steps in a PCA
• Have at least two variables
• Generate a correlation or variance-covariance matrix
• Obtain eigenvalues and eigenvectors (This is called
an eigenvalue problem, and will be illustrated with a
simple numerical example)
• Generate principal component (PC) scores
• Plot the PC scores in the space with reduced
dimensions
• All these can be automated by using SAS.
40
Abundance
30
Sp1
20
Sp2
10
0
20 Sp2
15 Sp3
10
5
0
For 0,
1 2 For 3,
A 1 2 x1 0
2 2 Ax 1 2 x1 x1
Ax 3
2 2 x2 0 x
2 2 x2 2
1 2
which is equivalent to
which is equivalent to
2 3 0
2 2 x1 2 x2 0,
x1 2 x2 3x1 ,
2 x1 2 x2 0
2 x1 2 x2 3x2
1 0, 2 3 x1
x2 x2 2 x1
2
Xuhua Xia Slide 18
Get the Eigenvectors
• We want to find an eigenvector of unit length, i.e.,
x12 + x22 = 1
• We therefore have From Previous Slide
x1
For 0, x2
2
2 x1 Solve x1
x2 1 x1
2
x1 0.8165, x2 0.5774
For 3,
The first eigenvector
2
x2 1 x 2 x1
1 is one associated
x1 0.5774, x 2 0.8165 with the largest
eigenvalue.
Xuhua Xia Slide 19
Get the PC Scores
First PC score
Original data (x and y) Eigenvectors
- 1.264911064 - 1.788854382 - 2.19089 0
- 0.632455532 - 0.894427191 - 1.09545 0
0.577350 0.816497
0 0 0.00000 0
0.816497 - .577350
0.632455532 0.894427191 1.09545 0
1.264911064 1.788854382 2.19089 0
Second PC score
The original data in a two dimensional
space is reduced to one dimension..
Xuhua Xia Slide 20
What Are Principal Components?
• Principal components are a new set of variables,
which are linear combinations of the observed ones,
with these properties:
– Because of the decreasing variance property, much of the
variance (information in the original set of p variables)
tends to be concentrated in the first few PCs. This implies
that we can drop the last few PCs without losing much
information. PCA is therefore considered as a dimension-
reduction technique.
– Because PCs are orthogonal, they can be used instead of
the original variables in situations where having
orthogonal variables is desirable (e.g., regression).
English
2 70 65 75
3 80 75
4 90 85 50
5 100 95 50 75 100
Mean 80.0 75.0 Math
Var 250 250
8 89
1.7
9 44
0.8
44
0 .89
-
9
. 788
-1
0 WY
ID KA IN OH
IL
SO PE MA NE
MI FL
WE VI OK TE
-1 NE
KE TE
AR GE
NO
Mississippi,
-2 AL LO SO
Alabama,
Louisiana,
MI
South Carolina
-3
-5 -3 -1 1 3 5 7
PC 1