You are on page 1of 33

Principal Components Analysis

• Objectives:
– Understand the principles of principal
components analysis (PCA)
– Recognize conditions under which PCA
may be useful
– Use SAS procedure PRINCOMP to
• perform a principal components analysis
• interpret PRINCOMP output.

Xuhua Xia Slide 1


Typical Form of Data
A data set in a 8x3 matrix. The rows could be species and
columns sampling sites.
100 97 99
96 90 90
80 75 60
75 85 95
X=
62 40 28
77 80 78
92 91 80
75 85 100
A matrix is often referred to as a nxp matrix (n for number of rows
and p for number of columns). Our matrix has 8 rows and 3
columns, and is an 8x3 matrix. A variance-covariance matrix has n
= p, and is called n-dimensional square matrix.
Xuhua Xia Slide 2
What are Principal Components?
Y = b1X1 + b2 X2 + … bn Xn

• Principal components are linear


combinations of the observed variables.
The coefficients of these principal
components are chosen to meet three
criteria
• What are the three criteria?

Xuhua Xia Slide 3


What are Principal Components?
• The three criteria:
– There are exactly p principal components
(PCs), each being a linear combination of
the observed variables;
– The PCs are mutually orthogonal (i.e.,
perpendicular and uncorrelated);
– The components are extracted in order of
decreasing variance.

Xuhua Xia Slide 4


A Simple Data Set
X Y 2
1.5
1 -1.264911064 -1.78885 1
2 -0.632455532 -0.89443 0.5
0
3 0 0

Y
-0.5
4 0.632455532 0.894427 -1
5 1.264911064 1.788854 -1.5
-2
Mean 0.0000 0.0000 -1.5 -1 -0.5 0 0.5 1 1.5
Var 1 2 X

rX ,Y 
 ( X  X )(Y  Y ) 
5.6569
1
n
 ( xi  x )( yi  y )
48 Cov( x, y )  i 1
 2
 ( X  X )  (Y  Y )
2 2
n 1

X Y Correlation matrix X Y
X 1 1 Covariance matrix X 1 1.414
Y 1 1 Y 1.414 2
Xuhua Xia Slide 5
General Patterns
• The total variance is 3 (= 1 + 2)
• The two variables, X and Y, are perfectly correlated,
with all points fall on the regression line.
• The spatial relationship among the 5 points can
therefore be represented by a single dimension.
• PCA is a dimension-reduction technique. What
would happen if we apply PCA to the data?

Xuhua Xia Slide 6


Graphic PCA

2
1.5
1
0.5
0
Y

-0.5
-1
-1.5
-2
-1.5 -1 -0.5 0 0.5 1 1.5

X
Xuhua Xia Slide 7
SAS Program
data pca;
input x y; Requesting the PCA to be
cards; carried out on the
-1.264911064 -1.788854382 covariance matrix rather
-0.632455532 -0.894427191 than the correlation
0 0
matrix.
0.632455532 0.894427191
1.264911064 1.788854382
;
proc princomp cov out=pcscore; Without specifying the
proc print;
var prin1 prin2; covariance option, PCA
proc princomp data=pca out=pcscore; will be carried out on the
proc print; correlation matrix.
var prin1 prin2;
run;

Xuhua Xia Slide 8


A positive definite matrix
• When you run the SAS program, the log file will warn that
“The Correlation Matrix is not positive definite.”. What does
that mean?
• A symmetric matrix M (such as a correlation matrix or a
covariance matrix) is positive definite if z’Mz > 0 for all non-
zero vectors z with real entries, where z’ is the transpose of z.
• Given our correlation matrix with all entries being 1, it is easy
to find z that lead to z’Mz = 0. So the matrix is not positive
definite:

1 1  z1  Replace the correlation matrix


 z1 z2     z   0 with the covariance matrix and
1 1  2 
Solution : z1   z2 solve for z.

Xuhua Xia Slide 9


SAS Output
Eigenvalues of the Covariance Matrix

Eigenvalue Difference Proportion Cumulative


PRIN1 3.00000 3.00000 1.00000 1.00000
PRIN2 0.00000 . 0.00000 1.00000
PC1 = 0.57735*X1+0.816497*X2
Eigenvectors
PRIN1 PRIN2

X 0.577350 0.816497
Y 0.816497 -.577350

Variance OBS PRIN1 PRIN2


accounted 1 -2.19089 0 Principal
for by each 2 -1.09545 0
3 0.00000 0 component
principal 4 1.09545 0 scores
components 5 2.19089 0
What’s the variance in PC1? How are the values computed?
Xuhua Xia Slide 10
SAS Output
1
0.8
0.6
PC2
0.4
0.2
0
-3 -2 -1 0 1 2 3
PC1

OBS PRIN1 PRIN2


1 -2.19089 0
2 -1.09545 0
3 0.00000 0
4 1.09545 0
5 2.19089 0

Xuhua Xia Slide 11


SAS Output
Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative


PRIN1 2.00000 2.00000 1.00000 1.00000
PRIN2 0.00000 . 0.00000 1.00000

Eigenvectors

PRIN1 PRIN2
X 0.707107 0.70710
Y 0.707107 -0.70711
Variance
OBS PRIN1 PRIN2
accounted 1 -1.78885 0 Principal
for by each 2 -0.89443 0 component
principal 3 0.00000 0
scores
4 0.89443 0
components 5 1.78885 0

Xuhua Xia
What’s the variance in PC1?
Slide 12
Steps in a PCA
• Have at least two variables
• Generate a correlation or variance-covariance matrix
• Obtain eigenvalues and eigenvectors (This is called
an eigenvalue problem, and will be illustrated with a
simple numerical example)
• Generate principal component (PC) scores
• Plot the PC scores in the space with reduced
dimensions
• All these can be automated by using SAS.

Xuhua Xia Slide 13


Covariance or Correlation Matrix?

40
Abundance

30
Sp1
20
Sp2
10
0

Xuhua Xia Slide 14


Covariance or Correlation Matrix?
35
30
25
Abundance

20 Sp2
15 Sp3
10
5
0

Xuhua Xia Slide 15


Covariance or Correlation Matrix?
35
30
25
Sp1
20
Sp2
15
Sp3
10
5
0

Xuhua Xia Slide 16


The Eigenvalue Problem
The covariance matrix.
1 2
A   The Eigenvalue is the set of
 2 2 values that satisfy this
condition.
1  2
 2  3  0 The resulting eigenvalues
2 2
(There are n eigenvalues for
n variables). The sum of
1  0, 2  3 eigenvalues is equal to the
sum of variances in the
covariance matrix.

Finding the eigenvalues and eigenvectors is called an eigenvalue


problem (or a characteristic value problem).
Xuhua Xia Slide 17
Get the Eigenvectors
• An eigenvector is a vector (x) that satisfies the following
condition:
A x = x
• In our case A is a variance-covariance matrix of the order of
2, and a vector x is a vector specified by x1 and x2.

For   0,
1 2 For   3,
A   1 2   x1  0
 2 2 Ax        1 2   x1   x1 
Ax    3
 2 2   x2  0   x 
 2 2   x2   2
1  2
which is equivalent to
which is equivalent to
 2  3  0
2 2 x1  2 x2  0,
x1  2 x2  3x1 ,
2 x1  2 x2  0
2 x1  2 x2  3x2
1  0, 2  3 x1
 x2    x2  2 x1
2
Xuhua Xia Slide 18
Get the Eigenvectors
• We want to find an eigenvector of unit length, i.e.,
x12 + x22 = 1
• We therefore have From Previous Slide
x1
For   0, x2  
2
2 x1 Solve x1
x2  1  x1  
2
x1  0.8165, x2  0.5774

For   3,
The first eigenvector
2
x2  1  x  2 x1
1 is one associated
x1  0.5774, x 2  0.8165 with the largest
eigenvalue.
Xuhua Xia Slide 19
Get the PC Scores

First PC score
Original data (x and y) Eigenvectors
- 1.264911064 - 1.788854382  - 2.19089 0
- 0.632455532 - 0.894427191 - 1.09545 0 
  0.577350 0.816497 
0 0     0.00000 0
  0.816497 - .577350   
0.632455532 0.894427191   1.09545 0
1.264911064 1.788854382   2.19089 0 

Second PC score
The original data in a two dimensional
space is reduced to one dimension..
Xuhua Xia Slide 20
What Are Principal Components?
• Principal components are a new set of variables,
which are linear combinations of the observed ones,
with these properties:
– Because of the decreasing variance property, much of the
variance (information in the original set of p variables)
tends to be concentrated in the first few PCs. This implies
that we can drop the last few PCs without losing much
information. PCA is therefore considered as a dimension-
reduction technique.
– Because PCs are orthogonal, they can be used instead of
the original variables in situations where having
orthogonal variables is desirable (e.g., regression).

Xuhua Xia Slide 21


Index of hidden variables
• The ranking of Asian universities by the Asian Week
– HKU is ranked second in financial resources, but seventh
in academic research
– How did HKU get ranked third?
– Is there a more objective way of ranking?
• An illustrative example:
School Math English Physics Chemistry Chinese
1 60 55 65 64 67
2 70 65 69 71 77
3 80 75 72 85 82
4 90 85 85 88 88
5 100 95 95 95 93
6 …. … … … …
Xuhua Xia Slide 22
A Simple Data Set
School Math English 100
1 60 55

English
2 70 65 75
3 80 75
4 90 85 50
5 100 95 50 75 100
Mean 80.0 75.0 Math
Var 250 250

• School 5 is clearly the best school


• School 1 is clearly the worst school

Xuhua Xia Slide 23


Graphic PCA

8 89
1.7

9 44
0.8

44
0 .89
-

9
. 788
-1

Xuhua Xia Slide 24


Crime Data in 50 States

STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO


ALABAMA 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7
ALASKA 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3
ARIZONA 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
ARKANSAS 8.8 27.6 83.2 203.4 972.6 1862.1 183.4
CALIFORNIA 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
COLORADO 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
CONNECTICUT 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
DELAWARE 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
FLORIDA 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
GEORGIA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
HAWAII 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4
IDAHO 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6
ILLINOIS 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
. . . . . . . .
. . . . . . . .
PROC PRINCOMP OUT=CRIMCOMP;
Xuhua Xia Slide 25
DATA CRIME;
TITLE 'CRIME RATES PER 100,000 POP BY STATE';
INPUT STATENAME $1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
CARDS;
Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7
Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3
Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4
California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4
Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6
Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4
Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9
Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3
Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4
Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7
Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9
Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5
Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1
Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5
Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1
Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4
Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4
Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2
Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1
Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4
New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5
New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5
New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8
North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1
North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7
Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4
/* Add to have a map view*/
Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 proc sort data=crimcomp out=crimcomp;
Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 by STATENAME;
Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 run;
Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 proc sort data=maps.us2 out=mymap;
South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 by STATENAME;
South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5
Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
run;
Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 data both;
Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 merge mymap crimcomp;
Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 by STATENAME;
Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 run;
Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 proc gmap data=both;
West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3
Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7
id _map_geometry_;
Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 choro PRIN1 PRIN2/levels=15;
; /* choro PRIN1/discrete; */
PROC PRINCOMP out=crimcomp; run;
run;
PROC PRINT;
ID STATENAME;
VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
run;
PROC GPLOT;
PLOT PRIN2*PRIN1=STATENAME;
TITLE2 'PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS';
run;
PROC PRINCOMP data=CRIME COV OUT=crimcomp;
run;
PROC PRINT;
ID STATENAME;
VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
run;
Correlation Matrix
MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO

MURDER 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688


RAPE 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489
ROBBERY 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907
ASSAULT 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758
BURGLARY 0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580
LARCENY 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442
AUTO 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000

If variables are not correlated, there would be no point in


doing PCA.
The correlation matrix is symmetric, so we only need to
inspect either the upper or lower triangular matrix.

Xuhua Xia Slide 28


Eigenvalues

Eigenvalue Difference Proportion Cumulative

PRIN1 4.11496 2.87624 0.587851 0.58785


PRIN2 1.23872 0.51291 0.176960 0.76481
PRIN3 0.72582 0.40938 0.103688 0.86850
PRIN4 0.31643 0.05846 0.045205 0.91370
PRIN5 0.25797 0.03593 0.036853 0.95056
PRIN6 0.22204 0.09798 0.031720 0.98228
PRIN7 0.12406 . 0.017722 1.00000

Xuhua Xia Slide 29


Eigenvectors
PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7

MURDER 0.3002 -.6291 0.1782 -.2321 0.5381 0.2591 0.2675


RAPE 0.4317 -.1694 -.2441 0.0622 0.1884 -.7732 -.2964
ROBBERY 0.3968 0.0422 0.4958 -.5579 -.5199 -.1143 -.0039
ASSAULT 0.3966 -.3435 -.0695 0.6298 -.5066 0.1723 0.1917
BURGLARY 0.4401 0.2033 -.2098 -.0575 0.1010 0.5359 -.6481
LARCENY 0.3573 0.4023 -.5392 -.2348 0.0300 0.0394 0.6016
AUTO 0.2951 0.5024 0.5683 0.4192 0.3697 -.0572 0.1470
• Do these eigenvectors mean anything?
– All crimes are positively correlated with the first eigenvector, which is
therefore interpreted as a measure of overall crime rate.
– The 2nd eigenvector has positive loadings on AUTO, LARCENY and
ROBBERY and negative loadings on MURDER, ASSAULT and RAPE.
It is interpreted to measure the preponderance of property crime over
violent crime…...

Xuhua Xia Slide 30


PC Plot: Crime Data
Maryland
3
MA Nevada, New
North and
2
RH York,
South HA California
Dakota CO
DE
1 VE MI UT NE COAR
IO
NE
WI WA
MA OR
NO NE
NEMO MI
AL CA
PC 2

0 WY
ID KA IN OH
IL
SO PE MA NE
MI FL
WE VI OK TE
-1 NE
KE TE
AR GE
NO
Mississippi,
-2 AL LO SO
Alabama,
Louisiana,
MI
South Carolina
-3
-5 -3 -1 1 3 5 7
PC 1

Xuhua Xia Slide 31


Plot of PC1

Prin1 -3.9640776 - -3.1477220 -2.5815619 - -2.4656229 -2.1507074 - -1.7269086 -1.7200694 - -1.5543424


-1.5073580 - -1.4246347 -1.0544104 - -0.6992517 -0.6340669 - -0.4998955 -0.3213630 - -0.1365951
-0.0498802 - 0.4904076 0.5129025 - 0.8231313 0.9305796 - 0.9784390 1.1202026 - 1.4490021
1.6033606 - 2.2733344 2.4215150 - 3.0141383 3.1117540 - 5.2669853
Plot of PC2

Prin2 -2.54671E+00 - -2.09610E+00 -2.08327E+00 - -1.38079E+00 -1.34544E+00 - -9.50756E-01


-8.14251E-01 - -6.81314E-01 -6.24288E-01 - -5.58511E-01 -2.54464E-01 - -1.94742E-01
-2.80416E-02 - 2.60334E-05 6.26829E-02 - 9.42305E-02 1.43187E-01 - 2.25739E-01
2.70992E-01 - 4.32893E-01 5.78785E-01 - 7.37764E-01 7.80831E-01 - 8.44945E-01
9.16596E-01 - 9.44967E-01 9.64209E-01 - 1.29674E+00 1.50123E+00 - 2.63105E+00

You might also like