Professional Documents
Culture Documents
1
Principal Components Analysis
• Consider a set of correlated variables x1 , . . . , xp
• Replace this original set of variables by a set of uncorre-
lated variables y 1 , . . . , y p
2
PCA and variation
• PCA focuses on describing the variation in the
data set
3
PCA and dimension reduction
• If the original variables are highly correlated, then
one can hope that only a few PCs will account for
most of the variation in the original variables.
4
A first example
We consider a dataset of 49 female sparrows
collected after a storm. The variables measured are
x1 Total length
x2 Alar length
x3 Length of beak and head
x4 Length of humerus
x5 Length of keel of sternum
5
Example: scatterplot matrix
−2 0 1 2 −2 0 1 2
1.5
0.0
−1.5
2
1
0
−2
2
1
−1 0
2
1
0
−2
2
1
0
−2
−1.5 0.0 1.5 −1 0 1 2 −2 0 1 2
6
Example: correlation matrix
x1 x2 x3 x4 x5
x1 1.00 0.73 0.66 0.63 0.61
7
Example: sparrows
The first PC is given by
8
The first PC
• The first PC y 1 is a linear combination of x1 , . . . , xp .
9
The second PC
• The second PC y 2 is a linear combination of x1 , . . . , xp
Σe = λe
12
Some matrix results
• A square matrix Σ of size p contains at most p
distinct eigenvalues
13
Finding PCs
The principal components are given by
with var(y k )
= λk
where λk and ek = (ek1 , . . . , ekp )t is the k th
eigenvalue-eigenvector pair of the
variance-covariance matrix S
14
Example: Sparrows
15
Total variance
• The total variance is the sum of the variances of
the components
16
Example: Sparrows
∑
p
Total variance = var(xj )
j=1
= 5.00
∑
p
= var(y j )
j=1
17
Proportion of the variance explained
• The proportion of the total variance that the k th
PC accounts for is
λk λk
pk = = ∑p
total variance j=1 λj
18
Example: Sparrows
3.58
p1 = 5 = 0.715 P1 = 0.715 (71.5%)
0.54
p2 = 5 = 0.107 P2 = 0.822 (82.2%)
0.38
p3 = 5 = 0.076 P3 = 0.898 (89.8%)
0.33
p4 = 5 = 0.066 P4 = 0.964 (96.4%)
0.18
p5 = 5 = 0.036 P5 = 1.00 (100%)
19
Number of PCs
• Trade-off between parsimony (low dimension)
and retaining enough relevant information
20
Number of PCs: criteria
• Retain a large percentage of the total variance
(between 70% and 90%)
21
Screeplot
• Plot the eigenvalues vs the PC number
• This is a decreasing curve
• Usually, there is an elbow point where the large
eigenvalues cease and the small eigenvalues
begin
22
Example: Sparrows
• First PC already represents 71.52%, first two
PCs 82.24% of total variance
−→ Retain only 1 PC
23
Example: screeplot
Sparrows: screeplot
3.5
3.0
2.5
2.0
Variances
1.5
1.0
0.5
0.0
24
Standardizing the variables
• Usually, the variables are centered: xj := xj − x̄j
→ The PCs are also centered at zero
• Often the variables are standardized
xj − x̄j
zj = √ j = 1, . . . , p
var(xj )
25
Example: sparrows
• Previous results were based on standardized
variables
26
Standardize?
• No: Variables with largest variance dominate first
PCs
27
Example: Air pollution
Measurements of 6 variables at 41 U.S. cities
28
Example: scatterplot matrix
0 2 4 −2 0 2 −3 −1 1
1
−1
4
2
0
3
1
−1
2
0
−2
2
0
−2
1
−1
−3
−1 1 −1 1 3 −2 0 2
29
Example: correlation matrix
Temp Manuf Pop Wind Precip Days
30
Example: Air pollution
Importance of components:
Comp1 Comp2 Comp3 Comp4 Comp5 Comp6
Loadings:
Comp1 Comp2 Comp3 Comp4 Comp5 Comp6
31
Example: Number of PCs
• Three PCs exceed average eigenvalue (= 1)
• Screeplot?
32
Example: screeplot
Screeplot
2.0
1.5
Variances
1.0
0.5
0.0
33
Example: Interpretation of PCs
Requires examining and interpreting the loadings
34
Displaying multivariate data: PCA
• Often the main goal of PCA is creating a useful
graphical representation of the data
35
Example: Scatterplot of PC2 vs PC1
2
Bffl
Sttl
Miam
Chrl
Clvl Prvd
Ptts Hrtf NwOr
Clmb Nrfl Jcks
Albn
Atln
Lsvl
Cncn Nshv
Wlmn
M−SP Indn MmphRchmLttR
Mlwk
Bltm Wshn
0
Hstn DsMn
Dtrt KnsC
St.LOmah
Phld Wcht
Dlls
Chcg SlLC
PC2
−2
Dnvr
SnFr
Albr
−4
Phnx
−6
−6 −4 −2 0 2
PC1
36
Example: Scatterplot of PC3 vs PC1
Miam
Chcg
2
Hstn NwOr
Jcks
Phld
Mmph
DllsAtln Phnx
Nshv LttR
Rchm
Bltm WshnLsvl
Cncn
St.L Nrfl Chrl
0
Dtrt SnFr
IndnKnsC
Clmb Wlmn
Sttl Hrtf
Clvl Ptts
Prvd
Wcht
Omah
DnvrAlbn Albr
Mlwk
M−SP DsMn SlLC
PC3
Bffl
−2
−4
−6
−6 −4 −2 0 2
PC1
37
Example: Scatterplot of PC3 vs PC2
Miam
Chcg
2
Hstn NwOr
Jcks
Phld
Mmph
1
Dlls Atln
Nshv
LttR
Rchm
Bltm
Wshn Lsvl
Cncn
St.L
Nrfl Chrl
0
SnFr Dtrt
KnsC Indn
Clmb
Wlmn
PC3
Hrtf Sttl
Clvl
Ptts
Prvd
−1
Wcht
Albr Dnvr Omah
Albn
SlLC DsMn
Mlwk
M−SP Bffl
−2
−3
−3 −2 −1 0 1 2
PC2
38
Biplots
• Plot of first PCs
39
Example: Biplot
−2.0 −1.5 −1.0 −0.5 0.0 0.5
Days
Bffl Precip
Sttl
Miam
0.5
1
Chrl
Clvl Prvd
Ptts Hrtf NwOr
Nrfl Jcks
Albn
Clmb Atln
LsvlNshv
Cncn Wlmn
Wind Indn MmphRchmLttR
M−SP
Mlwk
0.0
Bltm DsMn
0
HstnWshn
Dtrt KnsC Temp
Manuf St.L
Omah
PhldPop Wcht
−0.5
−1
Dlls
Comp.2
Chcg SlLC
Dnvr
−1.0
SnFr
−2
Albr
−1.5
−3
Phnx
−2.0
−4
−4 −3 −2 −1 0 1
Comp.1
40
Geometry of PCA
• PCs are a translation and rotation of the
coordinate axes
41
Geometry of PCA
y 2= e 2t x x2
y1= e1t x
f(x1,x 2) = c2
x1
42