Principal Component Analysis (PCA) : Principles, Biplots, and Modern Extensions For Sparse Data

Principal component analysis (PCA):
Principles, Biplots, and Modern Extensions for

Sparse Data
Steffen Unkel
Department of Medical Statistics
University Medical Center Göttingen
Summer term 2017
1/70
Outline
1 Principles of PCA
2 PCA biplots
3 Sparse PCA
2/70
1 Principles of PCA
3/70
Setting the scene
I The basic aim of PCA is to describe variation in a set of

correlated variables x1 , x2 , . . . , xp in terms of a new set of
uncorrelated variables y1 , y2 , . . . , yp .
I Each of y1 , y2 , . . . , yp is a linear combination of the x

variables (e.g. y1 = a11 x1 + a12 x2 + · · · + a1p xp ).
I The new variables are derived in decreasing order of

“importance”, in the sense that
I y1 accounts for as much of the variation (variance) in the
original data amongst all linear combinations of x1 , x2 , . . . , xp .
I Then, y2 is chosen to account for as much as possible of the
remaining variation, subject to being uncorrelated with y1 ,
and so on.
4/70
Principal components and dimensionality reduction
I The new variables defined by this process, y1 , y2 , . . . , yp , are

the principal components (PCs).
I The hope is that the first few PCs will account for a
substantial proportion of the variation in the original
variables x1 , x2 , . . . , xp .
I If so, the first few PCs can be used to provide a lower

dimensional summary of the data.
I The PCs form an orthogonal coordinate system.
5/70
The Olympic heptathlon data
I In the 1988 Olympics held in Seoul, the heptathlon was won

by one of the stars of women’s athletics in the USA, Jackie
Joyner-Kersee.
I The heptathlon data set in the R package HSAUR3 contains

the results for all 25 competitors in all seven disciplines.
library(HSAUR3)
data(heptathlon)
I We are using PCA with a view to exploring the structure of

these data and assessing how the PCs relate to the scores
assigned by the official scoring system.
6/70
Score all seven events in the same direction
heptathlon[c(14,25),]
## hurdles highjump shot run200m longjump javelin run800m

## Braun (FRG) 13.71 1.83 13.16 24.78 6.12 44.58 142.8
## Launa (PNG) 16.42 1.50 11.78 26.16 4.88 46.38 163.4
## score
## Braun (FRG) 6109
## Launa (PNG) 4566
heptathlon$hurdles <- with(heptathlon, max(hurdles)-hurdles)

heptathlon$run200m <- with(heptathlon, max(run200m)-run200m)
heptathlon$run800m <- with(heptathlon, max(run800m)-run800m)
heptathlon[c(14,25),]

## Braun (FRG) 2.71 1.83 13.16 1.83 6.12 44.58 20.61
## Launa (PNG) 0.00 1.50 11.78 0.45 4.88 46.38 0.00
## score
## Braun (FRG) 6109
## Launa (PNG) 4566
7/70
Scatterplot matrix
score <- which(colnames(heptathlon) == "score")
plot(heptathlon[, -score])
1.50 1.70 0 1 2 3 4 36 40 44
hurdles
2
0
1.80
highjump
1.50
14
shot
10
4
run200m
2
0
5.0 6.5
longjump
44
javelin
36
0 20
run800m
0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40
8/70
Correlation matrix
round(cor(heptathlon[,-score]), 2)

## hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78
## highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59
## shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42
## run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62
## longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70
## javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02
## run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00
9/70
Removing the outlier
heptathlon <- heptathlon[-grep("PNG", rownames(heptathlon)), ]

round(cor(heptathlon[,-score]), 2)

## hurdles 1.00 0.58 0.77 0.83 0.89 0.33 0.56
## highjump 0.58 1.00 0.46 0.39 0.66 0.35 0.15
## shot 0.77 0.46 1.00 0.67 0.78 0.34 0.41
## run200m 0.83 0.39 0.67 1.00 0.81 0.47 0.57
## longjump 0.89 0.66 0.78 0.81 1.00 0.29 0.52
## javelin 0.33 0.35 0.34 0.47 0.29 1.00 0.26
## run800m 0.56 0.15 0.41 0.57 0.52 0.26 1.00
10/70
Finding the sample principal components
I The first PC of the observations is the linear combination
y1 = a11 x1 + a12 x2 + · · · + a1p xp

whose sample variance is greatest among all such linear
combinations.
I Since the variance of y1 could be increased without limit

simply by increasing the coefficients a11 , a12 , . . . , a1p , a
restriction must be placed on these coefficients.
I A sensible constraint is to require that the sum of squares of

the coefficients for each PC should take the value one.
11/70
Eigendecomposition of the sample covariance matrix
I Let S be the positive semi-definite covariance matrix of a

mean-centered data matrix X ∈ Rn×p with rank(S) = r
(r ≤ p).
I The eigenvalue decomposition (or spectral decomposition) of

S can be written as
r
S = AΛA> = λi ai ai> ,
X
i=1
where Λ = diag(λ1 , . . . , λr ) is an r × r diagonal matrix

containing the positive eigenvalues of S, λ1 ≥ · · · ≥ λr > 0,
on its main diagonal and A ∈ Rp×r is a column-wise
orthonormal matrix whose columns a1 , . . . , ar are the
corresponding unit-norm eigenvectors of λ1 , . . . , λr .
12/70
PCA via the eigendecomposition
I PCA looks for r vectors aj ∈ Rp×1 (j = 1, . . . , r) which
maximize aj> Saj

subject to aj> aj = 1 for j = 1, . . . , r and
ai> aj = 0 for i = 1, . . . , j − 1 (j ≥ 2) .
I It turns out that yj = Xaj is the j-th sample PC with zero

mean and variance λj , where aj is an eigenvector of S
corresponding to its j-th largest eigenvalue λj (j = 1, . . . , r).
I The total variance of the r PCs will equal the total variance
of the original variables so that rj=1 λj = tr(S).
P
13/70
Singular value decomposition of the data matrix
I The sample PCs can also be found using the singular value
decomposition (SVD) of X.
I Expressing X with rank r with r ≤ min{n, p} by its SVD

gives
r
X = VDA> = σj vj aj> ,
X
j=1
where V = (v1 , . . . , vr ) ∈ Rn×r and

A = (a1 , . . . , ar ) ∈ Rp×r are orthonormal matrices such that
V> V = A> A = Ir , and D ∈ Rr×r is a diagonal matrix
with the singular values of X sorted in decreasing order,
σ1 ≥ σ2 ≥ . . . ≥ σr > 0, on its main diagonal.
14/70
PCA via the SVD
I The matrix A is composed of coefficients or loadings and the

matrix of component scores Y ∈ Rn×r is given by Y = VD.
I Since it holds that A> A = Ir and

Y> Y/(n − 1) = D2 /(n − 1), the loadings are orthogonal
and the sample PCs are uncorrelated.
I The variance of the j-th sample PC is σj2 /(n − 1) which is

equal to the j-th largest eigenvalue, λj , of S (j = 1, . . . , r).
15/70
PCA via the SVD
I In practice, the leading k components with k r usually

account for a substantial proportion
λ1 + · · · + λk
tr(S)
of the total variance in the data and the sum in the SVD of
X is therefore truncated after the first k terms.
I If so, PCA comes down to finding a matrix

Y = (y1 , . . . , yk ) ∈ Rn×k of component scores of the n
samples on the k components and a matrix
A = (a1 , . . . , ak ) ∈ Rp×k of coefficients whose k-th column is
the vector of loadings for the k-th component.
16/70
Finding the sample principal components in R
I In R, PCA can be done using the functions princomp() and

prcomp() (both contained in the R package stats).
I The princomp() function carries out PCA via an

eigendecomposition of the sample covariance matrix S.
I When the variables are on very different scales, PCA is

usually carried out on the correlation matrix R.
I These components are not equal to those derived from S.
17/70
Correlations and covariances of variables and
components
I The covariance of variable i with component j is given by
Cov(xi , yj ) = λj aji .
I The correlation of variable i with component j is therefore

p
λj aji
rxi ,yj = ,
si
where si is the standard deviation of variable i.
I If the components are extracted from the correlation matrix,

then
q
rxi ,yj = λj aji .
18/70
PCA using the function princomp()
I For PCA we assume that each of the variables in the n × p

data matrix X has been centered to have mean zero.
I Because the results for the seven heptathlon events are on

different scales we shall extract the PCs from the p × p
correlation matrix R.
heptathlon_pca <- princomp(heptathlon[, -score], cor=TRUE)
I The result is a list containing the coefficients defining each

component, the PC scores, et cetera.
19/70
Coefficients
The coefficients (also called loadings) for the first PC are
obtained as
a1 <- heptathlon_pca$loadings[,1]
a1

## -0.4504 -0.3145 -0.4025 -0.4271 -0.4510 -0.2423 -0.3029
a1%*%a1
## [,1]
## [1,] 1
a2 <- heptathlon_pca$loadings[,2]
a1%*%a2
## [,1]
## [1,] 2.22e-16
Each loading vector is unique, up to a sign flip.

20/70
Rescaled coefficients
The loadings can be rescaled so that coefficients for the most
important components are larger than those for less important
>
components (a∗ = λj aj , for which a∗ a∗ = λj ).
p
The rescaled loadings for the 1st PC are calculated as

rescaleda1 <- a1 * heptathlon_pca$sdev[1]
rescaleda1

## -0.9365 -0.6540 -0.8369 -0.8881 -0.9377 -0.5038 -0.6298
When the correlation matrix is analyzed, this rescaling leads to

loadings that are the correlations between the 1st PC and the
original variables.
rescaleda1%*%rescaleda1
## [,1]
## [1,] 4.324
21/70
The variance explained by the principal components
I The total variance of the p PCs will equal the total variance
of the original variables so that
p
X
λj = s12 + s22 + · · · + sp2 ,
j=1
where λj is the variance of the jth PC and sj2 is the sample

variance of xj .
I Consequently, the jth PC accounts for a proportion
λj
Pp
j=1 λj
and the first k PCs account for a proportion
Pk
j=1 λj
Pp .
j=1 λj
22/70
The summary() function
summary(heptathlon_pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Standard deviation 2.0793 0.9482 0.9109 0.68320 0.54619 0.33745
## Proportion of Variance 0.6177 0.1284 0.1185 0.06668 0.04262 0.01627
## Cumulative Proportion 0.6177 0.7461 0.8646 0.93131 0.97392 0.99019
## Comp.7
## Standard deviation 0.262042
## Proportion of Variance 0.009809
## Cumulative Proportion 1.000000
23/70
Criteria for choosing the number of components
1. Retain the first k components which explain a large

proportion of the total variation, say 70-80%.
2. If the correlation matrix is analyzed, retain only those

components with variances greater than one.
3. Examine a scree plot. This is a plot of the component

variances versus the component number. The idea is to look
for an “elbow” which corresponds to the point after which
the eigenvalues decrease more slowly.
4. Consider whether the component has a sensible and useful

interpretation.
24/70
Scree plot
plot(heptathlon_pca$sdev^2, xlab="Component number",
ylab="Component variance", type="l")
4
Component variance
3
2
1
0
1 2 3 4 5 6 7
Component number
25/70
Principal component scores
PC scores can be obtained either via heptathlon_pca$scores

or using the predict() function.
Scores on the 1st PC
heptathlon_pca$scores[,1]
or
predict(heptathlon_pca)[,1]
26/70
The uncorrelatedness of the PC scores
t(heptathlon_pca$scores)%*%heptathlon_pca$scores/(24)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

## Comp.1 4.324e+00 7.587e-16 -1.850e-16 -7.772e-16 -6.846e-16
## Comp.2 7.587e-16 8.990e-01 2.423e-15 -3.886e-16 -5.551e-16
## Comp.3 -1.850e-16 2.423e-15 8.297e-01 -1.824e-16 -1.746e-16
## Comp.4 -7.772e-16 -3.886e-16 -1.824e-16 4.668e-01 -9.946e-17
## Comp.5 -6.846e-16 -5.551e-16 -1.746e-16 -9.946e-17 2.983e-01
## Comp.6 1.230e-15 -7.517e-17 -1.214e-16 -4.077e-17 5.204e-17
## Comp.7 -1.943e-16 -6.823e-17 7.286e-17 -2.481e-16 9.483e-17
## Comp.6 Comp.7
## Comp.1 1.230e-15 -1.943e-16
## Comp.2 -7.517e-17 -6.823e-17
## Comp.3 -1.214e-16 7.286e-17
## Comp.4 -4.077e-17 -2.481e-16
## Comp.5 5.204e-17 9.483e-17
## Comp.6 1.139e-01 3.955e-16
## Comp.7 3.955e-16 6.867e-02
27/70
The scores assigned to the athletes and the 1st PC
cor(heptathlon$score, heptathlon_pca$scores[,1])
## [1] -0.9931
plot(heptathlon$score, heptathlon_pca$scores[,1])
4
2
heptathlon_pca$scores[, 1]
0
−2
−4
5500 6000 6500 7000
heptathlon$score 28/70
The USArrests data
I We now perform PCA on the USArrests data set, which is

contained in the R package datasets.
I For each of the 50 US states, the data set contains the

number of arrests per 100,000 residents in 1973 for each of
three crimes: Assault, Murder, and Rape.
I We also record UrbanPop, which measures the percentage of

the population in each state living in urban areas.
29/70
The USArrests data
The rows of the data set contain the 50 states in alphabetical

order and the columns contain the four variables.
head(USArrests)
## Murder Assault UrbanPop Rape

## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
30/70
Examining the USArrests data
apply(USArrests, 2, mean)

## 7.788 170.760 65.540 21.232
apply(USArrests, 2, var)

## 18.97 6945.17 209.52 87.73
31/70
PCA on a given data matrix
I The princomp() function performs PCA on a covariance

matrix S.
I We can also perform PCA directly on the n × p data matrix

X using the function prcomp().
I We assume that the variables in X have been centered to

have mean zero.
I Instead of performing PCA via an eigendecomposition of the

covariance matrix as in princomp(), the computation in
prcomp() is done by a singular value decomposition of the
(centered and possibly scaled) data matrix.
32/70
PCA using the function prcomp()
I Next, we perform PCA on the USArrests data using the

prcomp() function.
pr.out <- prcomp(USArrests, scale=TRUE)
I The calculation is done by a singular value decomposition of

the centered and scaled data matrix X.
I By default, the prcomp() function centers the variables to

have mean zero.
I By using the option scale=TRUE, we scale the variables to

have standard deviation one.
33/70
The output of prcomp()
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
pr.out
## Standard deviations:
## [1] 1.5749 0.9949 0.5971 0.4164
##
## Rotation:
## PC1 PC2 PC3 PC4
## Murder -0.5359 0.4182 -0.3412 0.64923
## Assault -0.5832 0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780 0.13388
## Rape -0.5434 -0.1673 0.8178 0.08902
34/70
Principal component scores
I When we matrix-multiply the X matrix by

pr.out$rotation, it gives the PC scores.
I Alternative: using the prcomp() function, the 50 × 4 matrix

x has as its columns the PC score vectors.
dim(pr.out$x)
## [1] 50 4
I That is, the kth column of x is the kth PC score vector.
35/70
Proportion of variance explained by the components
summary(pr.out)
## PC1 PC2 PC3 PC4
## Standard deviation 1.57 0.995 0.5971 0.4164
## Proportion of Variance 0.62 0.247 0.0891 0.0434
## Cumulative Proportion 0.62 0.868 0.9566 1.0000
pr.out$sdev
## [1] 1.5749 0.9949 0.5971 0.4164
pr.var <- pr.out$sdev^2

pve <- pr.var/sum(pr.var)
pve
## [1] 0.62006 0.24744 0.08914 0.04336
36/70
Plot of the proportion of variance explained
plot(pve, xlab="Principal Component",
ylab="Proportion of Variance Explained",
ylim=c(0,1),type='b')
1.0
0.8
Proportion of Variance Explained
0.6
0.4
0.2
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Principal Component
37/70
Plot of the cumulative proportion of variance explained
plot(cumsum(pve), xlab="Principal Component",
ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),type='b')
1.0
Cumulative Proportion of Variance Explained
0.8
0.6
0.4
0.2
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Principal Component
38/70
2 PCA biplots
39/70
Motivation
I Biplots are a graphical method for simultaneously displaying

the variables and sample units described by a multivariate
data matrix.
I A PCA biplot displays the component scores and the

variable loadings obtained by PCA in two or three
dimensions.
I The computations are based on the singular value

decomposition of the (centered and possibly scaled) data
matrix X.
I Two versions of PCA biplots do exist in the literature and

are implemented in software packages.
40/70
Example of the traditional form of a PCA biplot
−5
−4
−3
−2
r
−1 c
q p
j
u SPRk m g
v PLF 0 i h PC 1
−6 −4
t −2 0 2 4
n
RGF d
SLF f
s 1 e
w
ba
2
PC 2
Figure 1: The Gabriel form of a PCA biplot for aircraft data

41/70
PCA biplot for USArrests data
biplot(pr.out, scale=0)
3
Mississippi
North Carolina
2
South Carolina
0.5
Murder West Virginia
Vermont
Georgia
Alaska Alabama Arkansas
1
Louisiana
Tennessee Kentucky
South Dakota
Assault Montana North Dakota
Maryland Wyoming Maine
Virginia Idaho
PC2
FloridaNew Mexico
0.0
New Hampshire
0
Michigan Nebraska Iowa

Indiana
MissouriOklahoma
Kansas
Texas Delaware
OregonPennsylvania
RapeIllinois Wisconsin
Minnesota
Nevada NewArizona
York Ohio
Washington
−1
Colorado Connecticut
California NewMassachusetts
Jersey
Utah
Rhode Island
−0.5
Hawaii
−2
−3
UrbanPop
−3 −2 −1 0 1 2 3
PC1
42/70
The effect of scaling the variables
pr.noscale <- prcomp(USArrests, scale=FALSE)
par(mfrow = c(1,2))
biplot(pr.out, scale=0); biplot(pr.noscale, scale=0)
−0.5 0.0 0.5 −0.5 0.0 0.5 1.0
1.0
3
UrbanPop
150
Mississippi
North Carolina
2
South Carolina
0.5
100
Murder West Virginia
Vermont
0.5
Georgia
AlaskaAlabamaArkansas Kentucky
1
Louisiana
Tennessee South Dakota
50
Assault Montana North Dakota
Maryland Wyoming IdahoMaine Rape
Virginia
PC2
PC2
New Mexico Hawaii New Jersey
0.0
Florida New Hampshire Massachusetts
Rhode Island California
0
Utah Texas New

NevadaYork
Michigan Missouri Indiana Nebraska Iowa
Kansas
Oklahoma Wisconsin
Connecticut
Minnesota Ohio
Washington
Pennsylvania Colorado Illinois
Michigan
Missouri Delaware
Oregon ArizonaFlorida
0.0
Texas Delaware Oklahoma
Kansas
Indiana
Iowa Nebraska Virginia NewMaryland
Mexico
0
Rape Oregon
Pennsylvania New Hampshire Wyoming
Idaho Murder
Kentucky
Montana Louisiana
Georgia
Tennessee Assault
Illinois
Arizona Ohio Wisconsin
Minnesota Maine
North West
Dakota
South Dakota ArkansasAlabama
Nevada New York Alaska
Colorado Washington Vermont Virginia South Carolina
Mississippi
−1
Connecticut North Carolina

−50
California New Jersey

Utah
Massachusetts
Rhode Island
−0.5
Hawaii
−0.5
−2
−100
−3
UrbanPop
−3 −2 −1 0 1 2 3 −100 −50 0 50 100 150
PC1 PC1
43/70
Calibrated axes
I The arrows representing the variables can be converted into

calibrated axes analogous to ordinary scatterplots.
I Calibrated axes: The p variables are represented by p

non-orthogonal axes, known as biplot axes.
I The biplot axes are used in precisely the same way as the
Cartesian axes they approximate.
I This will give approximate values that do not in general

agree precisely with those in the data matrix X but
reproduce the entries in the matrix YA> .
44/70
PCA biplot with calibrated axes
RGF
SLF
6
0
4
0.1
r
5
c 0
q
3 p
j
k m 2
u g
v
t i h
4
2
n d
6 0.2
4 e f
w s
8 1
SPR ba
0.3 3
−1
−2
PLF
Figure 2: PCA biplot with calibrated axes for aircraft data

45/70
The R package BiplotGUI
PCA biplots with calibrated axes can be obtained using the
function PCAbipl() from the R package UBbipl, which is
available from http:
//www.wiley.com//legacy/wileychi/gower/material.html.
Alternatively, the BiplotGUI package provides a graphical user
interface (GUI) for the construction of, interaction with, and
manipulation of PCA biplots with calibrated axes in R.
library(BiplotGUI)
Biplots() is the sole function in the BiplotGUI package and

initialises the GUI for a given set of data.
Biplots(USArrests)
46/70
Application to Quality control data
I Throughout the period of a calendar month, a
manufacturing company is monitoring 15 different variables
in a production process.
I In an effort to quantify the overall product quality, this

company devised a quality index value.
I At the end of the month, the means and standard deviations

of the 15 selected variables were somehow transformed into
a single quality index value in the interval [0, 100].
I The index values give no indication of what the causes of a

poor index value could be.
I We perform a PCA on the monthly mean values of the 15

variables for January 2000 to March 2001.
47/70
PCA biplot of the (scaled) quality monitoring data
A4 (0.45) C6 (0.62) C8 (0.85) C7 (0.82) D4 (0.48)
32.5 66
1.8
A3 (0.8) 18
5.6
1.2 Jul00
32
64
1.7
1.15
5.4
2.5
16 31.5
Mar01
1.1
62
20
1.6 5.2 3 14.2
1.05 31
1 60
14
3.5
● 30.5 5
14.3
Apr00 Target 1.5 Aug00
20.5
79 49
0.95
22 Jun00 D7 (0.72)
A5 (0.89)
B5 26 27 21.5 28
21 29 30
45 20.5
79.2 Feb01 D6 (0.74)
4
58
44
30
50
43 0.9
4.8
14.4
1.4 Dec00 A2 (0.03)
12
May00 21 Jan01
A1 (0.2) Sep00
Feb00 0.85
4.5
29.5
Mar00 Jan00
56 Nov00
Oct00
4.6
14.5
0.8
C5 (0.47) C4 (0.53)
E5 (0.49)
Figure 3: PCA biplot of the scaled process quality data with a

multidimensional target interpolated. 48/70
PCA biplot with quality regions
A4 C6 C8 C7 D4
32.5 66
A3 1.8 18
5.6
1.2 Jul00
32
64
1.7
1.15
5.4
2.5
16 31.5
Mar01
1.1
62
20
1.6 5.2 3 14.2
1.05 31
1 60
14
3.5
30.5 5
Apr00 14.3
Target 1.5 Aug00
20.5
79 49
22 Jun00
0.95 D7
A5 B5
A5 26 27 21.5 28
21 29 30
45 20.5
4
79.2 Feb01 D6
44 58
50 30
43 0.9
14.4 4.8
1.4 Dec00 A2
12
May00 Jan01
A1 21 Sep00
Feb00 0.85
4.5
Jan00 29.5
Mar00 Nov00
56
Oct00
14.5 4.6
0.8
C5 E5 C4
Poor quality Satisfactory quality Good quality
Figure 4: PCA biplot of process quality data with a target, smooth

trend line and quality regions added. 49/70
Quality of fit attained with PCA
Table 1: Explained variation by the first four principal components of

the quality control data (cumulative proportion in percent).
1 dimension 2 dimensions 3 dimensions 4 dimensions

37.8% 59.8% 74.9% 82.7%
50/70
3 Sparse PCA
51/70
Motivation
I A sparse statistical model is one having only a small number

of nonzero parameters.
I In this Section, we discuss how PCA can be sparsified.
I That is, how can we derive principal components with

sparse loadings to yield more interpretable solutions.
I Sparse PCA is a natural extension of PCA well-suited to

high-dimensional data (p n).
52/70
Jeffers’ pitprops data
I Jeffers’ pitprops data is a classical example showing the

difficulty of interpreting principal components.
I The pitprops data is a correlation matrix of 13 physical

measurements made on a sample of 180 pitprops cut from
Corsican pine timber.
library(elasticnet)
data(pitprops)
dim(pitprops)
## [1] 13 13
53/70
The variables in Jeffers’ pitprops data
topdiam Top diameter in inches

length Length in inches
moist Moisture content, % of dry weight
testsg Specific gravity at time of test
ovensg Oven-dry specific gravity
ringtop Number of annual rings at top
ringbut Number of annual rings at bottom
bowmax Maximum bow in inches
bowdist Distance of point of maximum bow from top in inches
whorls Number of knot whorls
clear Length of clear prop from top in inches
knots Average number of knots per whorl
diaknot Average diameter of the knots in inches
54/70
PCA of pitprops data
pitprop.pca <- princomp(covmat = pitprops)

summary(pitprop.pca)
## Standard deviation 2.0539 1.5421 1.3705 1.05328 0.9540 0.90300
## Proportion of Variance 0.3245 0.1829 0.1445 0.08534 0.0700 0.06272
## Cumulative Proportion 0.3245 0.5074 0.6519 0.73726 0.8073 0.86999
## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11
## Standard deviation 0.75917 0.66300 0.59387 0.43685 0.22487
## Proportion of Variance 0.04433 0.03381 0.02713 0.01468 0.00389
## Cumulative Proportion 0.91432 0.94813 0.97526 0.98994 0.99383
## Comp.12 Comp.13
## Standard deviation 0.20363 0.196785
## Proportion of Variance 0.00319 0.002979
## Cumulative Proportion 0.99702 1.000000
55/70
Loadings of the first six components
pitprop.pca$loadings[,1:6]

## topdiam -0.40379 -0.21785 0.20729 0.09121 -0.08263 0.119803
## length -0.40554 -0.18613 0.23504 0.10272 -0.11279 0.162888
## moist -0.12440 -0.54064 -0.14149 -0.07844 0.34977 -0.275901
## testsg -0.17322 -0.45564 -0.35242 -0.05477 0.35576 -0.054017
## ovensg -0.05717 0.17007 -0.48121 -0.04911 0.17610 0.625557
## ringtop -0.28443 0.01420 -0.47526 0.06343 -0.31583 0.052301
## ringbut -0.39984 0.18964 -0.25310 0.06498 -0.21507 0.002658
## bowmax -0.29356 0.18915 0.24305 -0.28554 0.18533 -0.055119
## bowdist -0.35663 -0.01712 0.20764 -0.09672 -0.10611 0.034222
## whorls -0.37892 0.24845 0.11877 0.20504 0.15639 -0.173148
## clear 0.01109 -0.20530 0.07045 -0.80366 -0.34299 0.175312
## knots 0.11508 -0.34317 -0.09200 0.30080 -0.60037 -0.169783
## diaknot 0.11251 -0.30853 0.32611 0.30338 0.07990 0.626307
56/70
Rotation
I A traditional way to simplify loadings is by rotation.
I The method of rotation emerged in Factor Analysis and was

motivated both by solving the rotational indeterminacy
problem and by facilitating the factors’ interpretation.
I Rotation can be performed either in an orthogonal or an

oblique (non-orthogonal) fashion.
I Several analytic orthogonal and oblique rotation criteria do

exist in the literature.
I All criteria attempt to create a loading matrix whose

elements are close to zero or far from zero, with few
intermediate values.
57/70
Rotation
I If A is the loading matrix, then A is post-multiplied by a

matrix T to give rotated loadings B = AT.
I The rotation matrix T is chosen so as to optimize some

simplicity criterion.
I We would also need an algorithm that optimizes the chosen

rotation criterion and finds the “best” T.
I However, after rotation, either one or both of the properties

possessed by PCA, that is, orthogonality of the loadings and
uncorrelatedness of the component scores, is lost.
58/70
The Varimax rotation criterion
I Each variable should be either clearly important or clearly

unimportant in a rotated component, with as few cases as
possible of borderline importance.
I Varimax is the most widely used rotation criterion.
I Varimax tends to drive at least some of the loadings in each

component towards zero.
I A component whose loadings are all roughly equal will be

avoided by most standard rotation criteria.
59/70
Gradient projection algorithm
I Problems in multivariate statistics are often concerned with

the optimization of matrix functions of structured
(e.g. orthogonal) matrix unknowns.
I Gradient projection algorithms are natural ways of solving

such optimization problems as they are especially designed
to follow the geometry of the matrix parameters.
I They are based on the classical gradient approach and

modified for analyzing and solving constrained optimization
problems.
I The idea is to follow the steepest descent direction and to

keep the gradient flow “nailed” to the manifold of
permissible matrices.
60/70
Gradient projection algorithm for orthogonal rotation
I Here, the gradient projection algorithm for orthogonal

rotation is used to find T that minimizes f (V) over all
orthogonal matrices V.
I Let M be the manifold of all orthogonal matrices.
I Given a current value of V, this algorithm computes the

gradient of f at V and moves α units in the negative
gradient direction from V.
I The result is projected on M.
61/70
The Gradient projection algorithm visualized
∂f
V −α
∂V
updated V
V
Manifold of permissible
matrices M
Figure 5: Projection on a manifold of permissible matrices
62/70
Iterative scheme
I The algorithm proceeds iteratively; it is monotonically

descending and converges from any starting point to a
stationary point.
I At a stationary point of f restricted to M, the Frobenius

norm of the gradient after projection onto the plane tangent
to M at the current value of V is zero.
I The algorithm stops when the norm is less than some

prescribed precision, say 10−5 .
I Once the optimal rotation matrix T has been found, the

rotated loading matrix is obtained as B = AT.
63/70
Using the Varimax criterion for Jeffers’ pitprops data
library(GPArotation)
A <- pitprop.pca$loadings[,1:6]
B <- GPForth(A, method="varimax")$loadings
B

## topdiam -0.4732810 -0.093071 0.066309 -0.035053 -0.047369 0.20604
## length -0.4913803 -0.034979 0.053369 -0.039620 -0.047577 0.23409
## moist 0.0003155 -0.713627 0.148739 -0.001023 0.013383 -0.02893
## testsg -0.0154043 -0.681406 -0.170698 0.018187 0.005942 -0.01546
## ovensg 0.0493026 0.003042 -0.807205 -0.018262 0.138311 0.12255
## ringtop -0.2388960 -0.028431 -0.391257 -0.014618 -0.358794 -0.27123
## ringbut -0.3638911 0.092181 -0.259272 0.063757 -0.130426 -0.28490
## bowmax -0.2472791 0.033791 0.113722 -0.127971 0.439568 -0.12301
## bowdist -0.3980574 0.039518 0.086691 -0.138574 0.073745 0.01454
## whorls -0.3446344 0.052830 0.087417 0.339108 0.212771 -0.16254
## clear -0.0185273 0.009334 -0.008107 -0.916145 0.012019 -0.03890
## knots -0.0318758 0.025179 0.177287 -0.011326 -0.765165 0.02324
## diaknot -0.0516283 0.031716 -0.100118 0.047261 -0.029195 0.82952
64/70
Sparse PCA based on the “elastic net”
I The lasso approach in PCA: perform PCA under the extra

constraints pj=1 |akj | ≤ t for some tuning parameter t
P
(k = 1, . . . , p).
I The above-mentioned approach has several limitations.
I The so-called elastic net generalizes the lasso to overcome its

drawbacks.
I Elastic net approach in PCA: formulate PCA as a

regression-type optimization problem; obtain sparse loadings
by integrating a lasso penalty (via the elastic net) into the
regression criterion.
65/70
Sparse PCA (SPCA) criterion based on the “elastic net”
I Optimization problem:
n k k
||xi −AB> xi ||2 +λ
X X X
(Â, B̂) = arg min ||βj ||2 + λ1,j ||βj ||1
A,B
i=1 j=1 j=1
subject to A> A = Ik .
I In the SPCA criterion above, A = (α1 , . . . , αk ) and

B = (β1 , . . . , βk ) are p × k matrices, and || · || denotes the l1
norm.
I Whereas the same λ is used for all k components, different

λ1,j ’s are allowed for penalizing the loadings of different
principal components.
66/70
Alternating algorithm to minimize the SPCA criterion
I B given A: For each j, let Yj∗ = Xαj . Each β̂j in
B̂ = (β̂1 , . . . , β̂k ) is an elastic net estimate
β̂j = arg min ||Yj∗ − Xβj ||2 + λ||βj ||2 + λ1,j ||βj ||1 .
βj
I A given B: If B is fixed, then we can ignore the penalty

part on the SPCA criterion and only try to minimize
n
||xi − AB> xi ||2 = ||X − XBA> ||2F ,
X
i=1
subject to A> A = Ik . The solution is found via the SVD of
(X> X)B = UDV> ,
and we set Â = UV> .

67/70
Some remarks about SPCA
I Empirical evidence suggests that the output of the above

algorithm does not change much as λ is varied.
I Practically, λ is chosen to be a small positive number.
I Usually several combinations of λ1,j are tried to figure out a

good choice of the tuning parameters.
I Hence, we can pick a λ1,j that gives a good compromise

between explained variance and sparsity (variance-sparsity
trade-off).
68/70
Implementation of sparse PCA in R
I Efficient algorithms do exist to fit the elastic net approach

in PCA to multivariate data.
I Sparse PCA is implemented by the function spca() in the

R package elasticnet.
?spca
I The function arrayspc() in the R package elasticnet is

specifically designed for the case p n, as it is typically the
case in microarrays.
?arrayspc
69/70
Sparse PCA of Jeffers’ pitprops data
pitprop.spcap <- spca(pitprops,K = 6, type = "Gram", sparse = "penalty",
para=c(0.06,0.16,0.1,0.5,0.5,0.5))
pitprop.spcav <- spca(pitprops,K = 6, type = "Gram", sparse = "varnum",
para = c(7,4,4,1,1,1))
pitprop.spcap$loadings
## PC1 PC2 PC3 PC4 PC5 PC6

## topdiam -0.4774 0.00000 0.00000 0 0 0
## length -0.4759 0.00000 0.00000 0 0 0
## moist 0.0000 0.78471 0.00000 0 0 0
## testsg 0.0000 0.61936 0.00000 0 0 0
## ovensg 0.1766 0.00000 0.64065 0 0 0
## ringtop 0.0000 0.00000 0.58901 0 0 0
## ringbut -0.2505 0.00000 0.49233 0 0 0
## bowmax -0.3440 -0.02100 0.00000 0 0 0
## bowdist -0.4164 0.00000 0.00000 0 0 0
## whorls -0.4000 0.00000 0.00000 0 0 0
## clear 0.0000 0.00000 0.00000 -1 0 0
## knots 0.0000 0.01333 0.00000 0 -1 0
## diaknot 0.0000 0.00000 -0.01557 0 0 1
pitprop.spcap$pev
## [1] 0.28035 0.13966 0.13298 0.07445 0.06802 0.06227

70/70

Principal Component Analysis (PCA) : Principles, Biplots, and Modern Extensions For Sparse Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principal Component Analysis (PCA) : Principles, Biplots, and Modern Extensions For Sparse Data

Uploaded by

Copyright:

Available Formats

Principal component analysis (PCA):

Principles, Biplots, and Modern Extensions for

Summer term 2017

I The basic aim of PCA is to describe variation in a set of

I Each of y1 , y2 , . . . , yp is a linear combination of the x

I The new variables are derived in decreasing order of

I The new variables defined by this process, y1 , y2 , . . . , yp , are

I If so, the first few PCs can be used to provide a lower

I The PCs form an orthogonal coordinate system.

I In the 1988 Olympics held in Seoul, the heptathlon was won

I The heptathlon data set in the R package HSAUR3 contains

I We are using PCA with a view to exploring the structure of

## hurdles highjump shot run200m longjump javelin run800m

heptathlon$hurdles <- with(heptathlon, max(hurdles)-hurdles)

## hurdles highjump shot run200m longjump javelin run800m

0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40

## hurdles highjump shot run200m longjump javelin run800m

heptathlon <- heptathlon[-grep("PNG", rownames(heptathlon)), ]

## hurdles highjump shot run200m longjump javelin run800m

I The first PC of the observations is the linear combination

y1 = a11 x1 + a12 x2 + · · · + a1p xp

I Since the variance of y1 could be increased without limit

I A sensible constraint is to require that the sum of squares of

I Let S be the positive semi-definite covariance matrix of a

I The eigenvalue decomposition (or spectral decomposition) of

where Λ = diag(λ1 , . . . , λr ) is an r × r diagonal matrix

I PCA looks for r vectors aj ∈ Rp×1 (j = 1, . . . , r) which

maximize aj> Saj

I It turns out that yj = Xaj is the j-th sample PC with zero

I Expressing X with rank r with r ≤ min{n, p} by its SVD

where V = (v1 , . . . , vr ) ∈ Rn×r and

I The matrix A is composed of coefficients or loadings and the

I Since it holds that A> A = Ir and

I The variance of the j-th sample PC is σj2 /(n − 1) which is

I In practice, the leading k components with k  r usually

I If so, PCA comes down to finding a matrix

I In R, PCA can be done using the functions princomp() and

I The princomp() function carries out PCA via an

I When the variables are on very different scales, PCA is

I These components are not equal to those derived from S.

I The correlation of variable i with component j is therefore

I If the components are extracted from the correlation matrix,

I For PCA we assume that each of the variables in the n × p

I Because the results for the seven heptathlon events are on

heptathlon_pca <- princomp(heptathlon[, -score], cor=TRUE)

I The result is a list containing the coefficients defining each

## hurdles highjump shot run200m longjump javelin run800m

Each loading vector is unique, up to a sign flip.

The rescaled loadings for the 1st PC are calculated as

## hurdles highjump shot run200m longjump javelin run800m

When the correlation matrix is analyzed, this rescaling leads to

where λj is the variance of the jth PC and sj2 is the sample

1. Retain the first k components which explain a large

2. If the correlation matrix is analyzed, retain only those

3. Examine a scree plot. This is a plot of the component

4. Consider whether the component has a sensible and useful

PC scores can be obtained either via heptathlon_pca$scores

Scores on the 1st PC

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

5500 6000 6500 7000

I We now perform PCA on the USArrests data set, which is

I For each of the 50 US states, the data set contains the

I We also record UrbanPop, which measures the percentage of

I In practice, the leading k components with k r usually