You are on page 1of 70

Principal component analysis (PCA):

Principles, Biplots, and Modern Extensions for


Sparse Data

Steffen Unkel
Department of Medical Statistics
University Medical Center Göttingen

Summer term 2017

1/70
Outline

1 Principles of PCA

2 PCA biplots

3 Sparse PCA

2/70
1 Principles of PCA

3/70
Setting the scene

I The basic aim of PCA is to describe variation in a set of


correlated variables x1 , x2 , . . . , xp in terms of a new set of
uncorrelated variables y1 , y2 , . . . , yp .

I Each of y1 , y2 , . . . , yp is a linear combination of the x


variables (e.g. y1 = a11 x1 + a12 x2 + · · · + a1p xp ).

I The new variables are derived in decreasing order of


“importance”, in the sense that
I y1 accounts for as much of the variation (variance) in the
original data amongst all linear combinations of x1 , x2 , . . . , xp .
I Then, y2 is chosen to account for as much as possible of the
remaining variation, subject to being uncorrelated with y1 ,
and so on.

4/70
Principal components and dimensionality reduction

I The new variables defined by this process, y1 , y2 , . . . , yp , are


the principal components (PCs).

I The hope is that the first few PCs will account for a
substantial proportion of the variation in the original
variables x1 , x2 , . . . , xp .

I If so, the first few PCs can be used to provide a lower


dimensional summary of the data.

I The PCs form an orthogonal coordinate system.

5/70
The Olympic heptathlon data

I In the 1988 Olympics held in Seoul, the heptathlon was won


by one of the stars of women’s athletics in the USA, Jackie
Joyner-Kersee.

I The heptathlon data set in the R package HSAUR3 contains


the results for all 25 competitors in all seven disciplines.

library(HSAUR3)
data(heptathlon)

I We are using PCA with a view to exploring the structure of


these data and assessing how the PCs relate to the scores
assigned by the official scoring system.

6/70
Score all seven events in the same direction
heptathlon[c(14,25),]

## hurdles highjump shot run200m longjump javelin run800m


## Braun (FRG) 13.71 1.83 13.16 24.78 6.12 44.58 142.8
## Launa (PNG) 16.42 1.50 11.78 26.16 4.88 46.38 163.4
## score
## Braun (FRG) 6109
## Launa (PNG) 4566

heptathlon$hurdles <- with(heptathlon, max(hurdles)-hurdles)


heptathlon$run200m <- with(heptathlon, max(run200m)-run200m)
heptathlon$run800m <- with(heptathlon, max(run800m)-run800m)

heptathlon[c(14,25),]

## hurdles highjump shot run200m longjump javelin run800m


## Braun (FRG) 2.71 1.83 13.16 1.83 6.12 44.58 20.61
## Launa (PNG) 0.00 1.50 11.78 0.45 4.88 46.38 0.00
## score
## Braun (FRG) 6109
## Launa (PNG) 4566

7/70
Scatterplot matrix
score <- which(colnames(heptathlon) == "score")
plot(heptathlon[, -score])

1.50 1.70 0 1 2 3 4 36 40 44

hurdles

2
0
1.80

highjump
1.50

14
shot

10
4

run200m
2
0

5.0 6.5
longjump
44

javelin
36

0 20
run800m

0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40

8/70
Correlation matrix

round(cor(heptathlon[,-score]), 2)

## hurdles highjump shot run200m longjump javelin run800m


## hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78
## highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59
## shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42
## run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62
## longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70
## javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02
## run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00

9/70
Removing the outlier

heptathlon <- heptathlon[-grep("PNG", rownames(heptathlon)), ]


round(cor(heptathlon[,-score]), 2)

## hurdles highjump shot run200m longjump javelin run800m


## hurdles 1.00 0.58 0.77 0.83 0.89 0.33 0.56
## highjump 0.58 1.00 0.46 0.39 0.66 0.35 0.15
## shot 0.77 0.46 1.00 0.67 0.78 0.34 0.41
## run200m 0.83 0.39 0.67 1.00 0.81 0.47 0.57
## longjump 0.89 0.66 0.78 0.81 1.00 0.29 0.52
## javelin 0.33 0.35 0.34 0.47 0.29 1.00 0.26
## run800m 0.56 0.15 0.41 0.57 0.52 0.26 1.00

10/70
Finding the sample principal components

I The first PC of the observations is the linear combination

y1 = a11 x1 + a12 x2 + · · · + a1p xp


whose sample variance is greatest among all such linear
combinations.

I Since the variance of y1 could be increased without limit


simply by increasing the coefficients a11 , a12 , . . . , a1p , a
restriction must be placed on these coefficients.

I A sensible constraint is to require that the sum of squares of


the coefficients for each PC should take the value one.

11/70
Eigendecomposition of the sample covariance matrix

I Let S be the positive semi-definite covariance matrix of a


mean-centered data matrix X ∈ Rn×p with rank(S) = r
(r ≤ p).

I The eigenvalue decomposition (or spectral decomposition) of


S can be written as
r
S = AΛA> = λi ai ai> ,
X

i=1

where Λ = diag(λ1 , . . . , λr ) is an r × r diagonal matrix


containing the positive eigenvalues of S, λ1 ≥ · · · ≥ λr > 0,
on its main diagonal and A ∈ Rp×r is a column-wise
orthonormal matrix whose columns a1 , . . . , ar are the
corresponding unit-norm eigenvectors of λ1 , . . . , λr .

12/70
PCA via the eigendecomposition

I PCA looks for r vectors aj ∈ Rp×1 (j = 1, . . . , r) which

maximize aj> Saj


subject to aj> aj = 1 for j = 1, . . . , r and
ai> aj = 0 for i = 1, . . . , j − 1 (j ≥ 2) .

I It turns out that yj = Xaj is the j-th sample PC with zero


mean and variance λj , where aj is an eigenvector of S
corresponding to its j-th largest eigenvalue λj (j = 1, . . . , r).

I The total variance of the r PCs will equal the total variance
of the original variables so that rj=1 λj = tr(S).
P

13/70
Singular value decomposition of the data matrix

I The sample PCs can also be found using the singular value
decomposition (SVD) of X.

I Expressing X with rank r with r ≤ min{n, p} by its SVD


gives
r
X = VDA> = σj vj aj> ,
X

j=1

where V = (v1 , . . . , vr ) ∈ Rn×r and


A = (a1 , . . . , ar ) ∈ Rp×r are orthonormal matrices such that
V> V = A> A = Ir , and D ∈ Rr×r is a diagonal matrix
with the singular values of X sorted in decreasing order,
σ1 ≥ σ2 ≥ . . . ≥ σr > 0, on its main diagonal.

14/70
PCA via the SVD

I The matrix A is composed of coefficients or loadings and the


matrix of component scores Y ∈ Rn×r is given by Y = VD.

I Since it holds that A> A = Ir and


Y> Y/(n − 1) = D2 /(n − 1), the loadings are orthogonal
and the sample PCs are uncorrelated.

I The variance of the j-th sample PC is σj2 /(n − 1) which is


equal to the j-th largest eigenvalue, λj , of S (j = 1, . . . , r).

15/70
PCA via the SVD

I In practice, the leading k components with k  r usually


account for a substantial proportion
λ1 + · · · + λk
tr(S)

of the total variance in the data and the sum in the SVD of
X is therefore truncated after the first k terms.

I If so, PCA comes down to finding a matrix


Y = (y1 , . . . , yk ) ∈ Rn×k of component scores of the n
samples on the k components and a matrix
A = (a1 , . . . , ak ) ∈ Rp×k of coefficients whose k-th column is
the vector of loadings for the k-th component.

16/70
Finding the sample principal components in R

I In R, PCA can be done using the functions princomp() and


prcomp() (both contained in the R package stats).

I The princomp() function carries out PCA via an


eigendecomposition of the sample covariance matrix S.

I When the variables are on very different scales, PCA is


usually carried out on the correlation matrix R.

I These components are not equal to those derived from S.

17/70
Correlations and covariances of variables and
components
I The covariance of variable i with component j is given by

Cov(xi , yj ) = λj aji .

I The correlation of variable i with component j is therefore


p
λj aji
rxi ,yj = ,
si
where si is the standard deviation of variable i.

I If the components are extracted from the correlation matrix,


then
q
rxi ,yj = λj aji .

18/70
PCA using the function princomp()

I For PCA we assume that each of the variables in the n × p


data matrix X has been centered to have mean zero.

I Because the results for the seven heptathlon events are on


different scales we shall extract the PCs from the p × p
correlation matrix R.

heptathlon_pca <- princomp(heptathlon[, -score], cor=TRUE)

I The result is a list containing the coefficients defining each


component, the PC scores, et cetera.

19/70
Coefficients
The coefficients (also called loadings) for the first PC are
obtained as
a1 <- heptathlon_pca$loadings[,1]
a1

## hurdles highjump shot run200m longjump javelin run800m


## -0.4504 -0.3145 -0.4025 -0.4271 -0.4510 -0.2423 -0.3029

a1%*%a1

## [,1]
## [1,] 1

a2 <- heptathlon_pca$loadings[,2]
a1%*%a2

## [,1]
## [1,] 2.22e-16

Each loading vector is unique, up to a sign flip.


20/70
Rescaled coefficients
The loadings can be rescaled so that coefficients for the most
important components are larger than those for less important
>
components (a∗ = λj aj , for which a∗ a∗ = λj ).
p

The rescaled loadings for the 1st PC are calculated as


rescaleda1 <- a1 * heptathlon_pca$sdev[1]
rescaleda1

## hurdles highjump shot run200m longjump javelin run800m


## -0.9365 -0.6540 -0.8369 -0.8881 -0.9377 -0.5038 -0.6298

When the correlation matrix is analyzed, this rescaling leads to


loadings that are the correlations between the 1st PC and the
original variables.
rescaleda1%*%rescaleda1

## [,1]
## [1,] 4.324

21/70
The variance explained by the principal components
I The total variance of the p PCs will equal the total variance
of the original variables so that
p
X
λj = s12 + s22 + · · · + sp2 ,
j=1

where λj is the variance of the jth PC and sj2 is the sample


variance of xj .
I Consequently, the jth PC accounts for a proportion

λj
Pp
j=1 λj
and the first k PCs account for a proportion
Pk
j=1 λj
Pp .
j=1 λj

22/70
The summary() function

summary(heptathlon_pca)

## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Standard deviation 2.0793 0.9482 0.9109 0.68320 0.54619 0.33745
## Proportion of Variance 0.6177 0.1284 0.1185 0.06668 0.04262 0.01627
## Cumulative Proportion 0.6177 0.7461 0.8646 0.93131 0.97392 0.99019
## Comp.7
## Standard deviation 0.262042
## Proportion of Variance 0.009809
## Cumulative Proportion 1.000000

23/70
Criteria for choosing the number of components

1. Retain the first k components which explain a large


proportion of the total variation, say 70-80%.

2. If the correlation matrix is analyzed, retain only those


components with variances greater than one.

3. Examine a scree plot. This is a plot of the component


variances versus the component number. The idea is to look
for an “elbow” which corresponds to the point after which
the eigenvalues decrease more slowly.

4. Consider whether the component has a sensible and useful


interpretation.

24/70
Scree plot
plot(heptathlon_pca$sdev^2, xlab="Component number",
ylab="Component variance", type="l")

4
Component variance

3
2
1
0

1 2 3 4 5 6 7

Component number

25/70
Principal component scores

PC scores can be obtained either via heptathlon_pca$scores


or using the predict() function.

Scores on the 1st PC

heptathlon_pca$scores[,1]

or

predict(heptathlon_pca)[,1]

26/70
The uncorrelatedness of the PC scores

t(heptathlon_pca$scores)%*%heptathlon_pca$scores/(24)

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5


## Comp.1 4.324e+00 7.587e-16 -1.850e-16 -7.772e-16 -6.846e-16
## Comp.2 7.587e-16 8.990e-01 2.423e-15 -3.886e-16 -5.551e-16
## Comp.3 -1.850e-16 2.423e-15 8.297e-01 -1.824e-16 -1.746e-16
## Comp.4 -7.772e-16 -3.886e-16 -1.824e-16 4.668e-01 -9.946e-17
## Comp.5 -6.846e-16 -5.551e-16 -1.746e-16 -9.946e-17 2.983e-01
## Comp.6 1.230e-15 -7.517e-17 -1.214e-16 -4.077e-17 5.204e-17
## Comp.7 -1.943e-16 -6.823e-17 7.286e-17 -2.481e-16 9.483e-17
## Comp.6 Comp.7
## Comp.1 1.230e-15 -1.943e-16
## Comp.2 -7.517e-17 -6.823e-17
## Comp.3 -1.214e-16 7.286e-17
## Comp.4 -4.077e-17 -2.481e-16
## Comp.5 5.204e-17 9.483e-17
## Comp.6 1.139e-01 3.955e-16
## Comp.7 3.955e-16 6.867e-02

27/70
The scores assigned to the athletes and the 1st PC
cor(heptathlon$score, heptathlon_pca$scores[,1])

## [1] -0.9931

plot(heptathlon$score, heptathlon_pca$scores[,1])
4
2
heptathlon_pca$scores[, 1]

0
−2
−4

5500 6000 6500 7000

heptathlon$score 28/70
The USArrests data

I We now perform PCA on the USArrests data set, which is


contained in the R package datasets.

I For each of the 50 US states, the data set contains the


number of arrests per 100,000 residents in 1973 for each of
three crimes: Assault, Murder, and Rape.

I We also record UrbanPop, which measures the percentage of


the population in each state living in urban areas.

29/70
The USArrests data

The rows of the data set contain the 50 states in alphabetical


order and the columns contain the four variables.

head(USArrests)

## Murder Assault UrbanPop Rape


## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7

30/70
Examining the USArrests data

apply(USArrests, 2, mean)

## Murder Assault UrbanPop Rape


## 7.788 170.760 65.540 21.232

apply(USArrests, 2, var)

## Murder Assault UrbanPop Rape


## 18.97 6945.17 209.52 87.73

31/70
PCA on a given data matrix

I The princomp() function performs PCA on a covariance


matrix S.

I We can also perform PCA directly on the n × p data matrix


X using the function prcomp().

I We assume that the variables in X have been centered to


have mean zero.

I Instead of performing PCA via an eigendecomposition of the


covariance matrix as in princomp(), the computation in
prcomp() is done by a singular value decomposition of the
(centered and possibly scaled) data matrix.

32/70
PCA using the function prcomp()

I Next, we perform PCA on the USArrests data using the


prcomp() function.

pr.out <- prcomp(USArrests, scale=TRUE)

I The calculation is done by a singular value decomposition of


the centered and scaled data matrix X.

I By default, the prcomp() function centers the variables to


have mean zero.

I By using the option scale=TRUE, we scale the variables to


have standard deviation one.

33/70
The output of prcomp()
names(pr.out)

## [1] "sdev" "rotation" "center" "scale" "x"

pr.out

## Standard deviations:
## [1] 1.5749 0.9949 0.5971 0.4164
##
## Rotation:
## PC1 PC2 PC3 PC4
## Murder -0.5359 0.4182 -0.3412 0.64923
## Assault -0.5832 0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780 0.13388
## Rape -0.5434 -0.1673 0.8178 0.08902

34/70
Principal component scores

I When we matrix-multiply the X matrix by


pr.out$rotation, it gives the PC scores.

I Alternative: using the prcomp() function, the 50 × 4 matrix


x has as its columns the PC score vectors.

dim(pr.out$x)

## [1] 50 4

I That is, the kth column of x is the kth PC score vector.

35/70
Proportion of variance explained by the components
summary(pr.out)

## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.57 0.995 0.5971 0.4164
## Proportion of Variance 0.62 0.247 0.0891 0.0434
## Cumulative Proportion 0.62 0.868 0.9566 1.0000

pr.out$sdev

## [1] 1.5749 0.9949 0.5971 0.4164

pr.var <- pr.out$sdev^2


pve <- pr.var/sum(pr.var)
pve

## [1] 0.62006 0.24744 0.08914 0.04336

36/70
Plot of the proportion of variance explained
plot(pve, xlab="Principal Component",
ylab="Proportion of Variance Explained",
ylim=c(0,1),type='b')

1.0
0.8
Proportion of Variance Explained

0.6
0.4
0.2
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Principal Component

37/70
Plot of the cumulative proportion of variance explained
plot(cumsum(pve), xlab="Principal Component",
ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),type='b')

1.0
Cumulative Proportion of Variance Explained

0.8
0.6
0.4
0.2
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Principal Component

38/70
2 PCA biplots

39/70
Motivation

I Biplots are a graphical method for simultaneously displaying


the variables and sample units described by a multivariate
data matrix.

I A PCA biplot displays the component scores and the


variable loadings obtained by PCA in two or three
dimensions.

I The computations are based on the singular value


decomposition of the (centered and possibly scaled) data
matrix X.

I Two versions of PCA biplots do exist in the literature and


are implemented in software packages.

40/70
Example of the traditional form of a PCA biplot
−5

−4

−3

−2

r
−1 c
q p
j
u SPRk m g
v PLF 0 i h PC 1
−6 −4
t −2 0 2 4
n
RGF d
SLF f
s 1 e
w

ba
2

PC 2

Figure 1: The Gabriel form of a PCA biplot for aircraft data


41/70
PCA biplot for USArrests data
biplot(pr.out, scale=0)

3
Mississippi
North Carolina

2
South Carolina

0.5
Murder West Virginia
Vermont
Georgia
Alaska Alabama Arkansas
1

Louisiana
Tennessee Kentucky
South Dakota
Assault Montana North Dakota
Maryland Wyoming Maine
Virginia Idaho
PC2

FloridaNew Mexico

0.0
New Hampshire
0

Michigan Nebraska Iowa


Indiana
MissouriOklahoma
Kansas
Texas Delaware
OregonPennsylvania
RapeIllinois Wisconsin
Minnesota
Nevada NewArizona
York Ohio
Washington
−1

Colorado Connecticut
California NewMassachusetts
Jersey
Utah
Rhode Island

−0.5
Hawaii
−2
−3

UrbanPop

−3 −2 −1 0 1 2 3

PC1

42/70
The effect of scaling the variables
pr.noscale <- prcomp(USArrests, scale=FALSE)
par(mfrow = c(1,2))
biplot(pr.out, scale=0); biplot(pr.noscale, scale=0)

−0.5 0.0 0.5 −0.5 0.0 0.5 1.0

1.0
3

UrbanPop

150
Mississippi
North Carolina
2

South Carolina

0.5

100
Murder West Virginia
Vermont

0.5
Georgia
AlaskaAlabamaArkansas Kentucky
1

Louisiana
Tennessee South Dakota

50
Assault Montana North Dakota
Maryland Wyoming IdahoMaine Rape
Virginia
PC2

PC2
New Mexico Hawaii New Jersey
0.0
Florida New Hampshire Massachusetts
Rhode Island California
0

Utah Texas New


NevadaYork
Michigan Missouri Indiana Nebraska Iowa
Kansas
Oklahoma Wisconsin
Connecticut
Minnesota Ohio
Washington
Pennsylvania Colorado Illinois
Michigan
Missouri Delaware
Oregon ArizonaFlorida

0.0
Texas Delaware Oklahoma
Kansas
Indiana
Iowa Nebraska Virginia NewMaryland
Mexico
0
Rape Oregon
Pennsylvania New Hampshire Wyoming
Idaho Murder
Kentucky
Montana Louisiana
Georgia
Tennessee Assault
Illinois
Arizona Ohio Wisconsin
Minnesota Maine
North West
Dakota
South Dakota ArkansasAlabama
Nevada New York Alaska
Colorado Washington Vermont Virginia South Carolina
Mississippi
−1

Connecticut North Carolina


−50

California New Jersey


Utah
Massachusetts
Rhode Island
−0.5

Hawaii

−0.5
−2

−100
−3

UrbanPop

−3 −2 −1 0 1 2 3 −100 −50 0 50 100 150

PC1 PC1

43/70
Calibrated axes

I The arrows representing the variables can be converted into


calibrated axes analogous to ordinary scatterplots.

I Calibrated axes: The p variables are represented by p


non-orthogonal axes, known as biplot axes.

I The biplot axes are used in precisely the same way as the
Cartesian axes they approximate.

I This will give approximate values that do not in general


agree precisely with those in the data matrix X but
reproduce the entries in the matrix YA> .

44/70
PCA biplot with calibrated axes

RGF
SLF
6
0

4
0.1
r
5
c 0
q
3 p
j
k m 2
u g
v
t i h
4
2
n d
6 0.2
4 e f
w s
8 1

SPR ba

0.3 3

−1

−2
PLF

Figure 2: PCA biplot with calibrated axes for aircraft data


45/70
The R package BiplotGUI
PCA biplots with calibrated axes can be obtained using the
function PCAbipl() from the R package UBbipl, which is
available from http:
//www.wiley.com//legacy/wileychi/gower/material.html.
Alternatively, the BiplotGUI package provides a graphical user
interface (GUI) for the construction of, interaction with, and
manipulation of PCA biplots with calibrated axes in R.

library(BiplotGUI)

Biplots() is the sole function in the BiplotGUI package and


initialises the GUI for a given set of data.

Biplots(USArrests)

46/70
Application to Quality control data
I Throughout the period of a calendar month, a
manufacturing company is monitoring 15 different variables
in a production process.

I In an effort to quantify the overall product quality, this


company devised a quality index value.

I At the end of the month, the means and standard deviations


of the 15 selected variables were somehow transformed into
a single quality index value in the interval [0, 100].

I The index values give no indication of what the causes of a


poor index value could be.

I We perform a PCA on the monthly mean values of the 15


variables for January 2000 to March 2001.

47/70
PCA biplot of the (scaled) quality monitoring data
A4 (0.45) C6 (0.62) C8 (0.85) C7 (0.82) D4 (0.48)

32.5 66

1.8
A3 (0.8) 18
5.6

1.2 Jul00
32

64

1.7
1.15
5.4
2.5
16 31.5

Mar01
1.1
62

20
1.6 5.2 3 14.2
1.05 31

1 60
14
3.5
● 30.5 5
14.3
Apr00 Target 1.5 Aug00
20.5
79 49
0.95
22 Jun00 D7 (0.72)
A5 (0.89)
B5 26 27 21.5 28
21 29 30
45 20.5
79.2 Feb01 D6 (0.74)
4
58
44
30
50
43 0.9
4.8
14.4
1.4 Dec00 A2 (0.03)
12
May00 21 Jan01
A1 (0.2) Sep00
Feb00 0.85
4.5
29.5
Mar00 Jan00
56 Nov00
Oct00
4.6
14.5
0.8

C5 (0.47) C4 (0.53)
E5 (0.49)

Figure 3: PCA biplot of the scaled process quality data with a


multidimensional target interpolated. 48/70
PCA biplot with quality regions
A4 C6 C8 C7 D4

32.5 66

A3 1.8 18
5.6

1.2 Jul00
32

64

1.7
1.15
5.4
2.5
16 31.5

Mar01
1.1
62

20
1.6 5.2 3 14.2
1.05 31

1 60
14
3.5
30.5 5
Apr00 14.3
Target 1.5 Aug00
20.5
79 49
22 Jun00
0.95 D7
A5 B5
A5 26 27 21.5 28
21 29 30
45 20.5
4
79.2 Feb01 D6
44 58
50 30
43 0.9
14.4 4.8

1.4 Dec00 A2
12
May00 Jan01
A1 21 Sep00
Feb00 0.85
4.5
Jan00 29.5
Mar00 Nov00
56
Oct00
14.5 4.6
0.8

C5 E5 C4

Poor quality Satisfactory quality Good quality

Figure 4: PCA biplot of process quality data with a target, smooth


trend line and quality regions added. 49/70
Quality of fit attained with PCA

Table 1: Explained variation by the first four principal components of


the quality control data (cumulative proportion in percent).

1 dimension 2 dimensions 3 dimensions 4 dimensions


37.8% 59.8% 74.9% 82.7%

50/70
3 Sparse PCA

51/70
Motivation

I A sparse statistical model is one having only a small number


of nonzero parameters.

I In this Section, we discuss how PCA can be sparsified.

I That is, how can we derive principal components with


sparse loadings to yield more interpretable solutions.

I Sparse PCA is a natural extension of PCA well-suited to


high-dimensional data (p  n).

52/70
Jeffers’ pitprops data

I Jeffers’ pitprops data is a classical example showing the


difficulty of interpreting principal components.

I The pitprops data is a correlation matrix of 13 physical


measurements made on a sample of 180 pitprops cut from
Corsican pine timber.

library(elasticnet)
data(pitprops)
dim(pitprops)

## [1] 13 13

53/70
The variables in Jeffers’ pitprops data

topdiam Top diameter in inches


length Length in inches
moist Moisture content, % of dry weight
testsg Specific gravity at time of test
ovensg Oven-dry specific gravity
ringtop Number of annual rings at top
ringbut Number of annual rings at bottom
bowmax Maximum bow in inches
bowdist Distance of point of maximum bow from top in inches
whorls Number of knot whorls
clear Length of clear prop from top in inches
knots Average number of knots per whorl
diaknot Average diameter of the knots in inches

54/70
PCA of pitprops data

pitprop.pca <- princomp(covmat = pitprops)


summary(pitprop.pca)

## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Standard deviation 2.0539 1.5421 1.3705 1.05328 0.9540 0.90300
## Proportion of Variance 0.3245 0.1829 0.1445 0.08534 0.0700 0.06272
## Cumulative Proportion 0.3245 0.5074 0.6519 0.73726 0.8073 0.86999
## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11
## Standard deviation 0.75917 0.66300 0.59387 0.43685 0.22487
## Proportion of Variance 0.04433 0.03381 0.02713 0.01468 0.00389
## Cumulative Proportion 0.91432 0.94813 0.97526 0.98994 0.99383
## Comp.12 Comp.13
## Standard deviation 0.20363 0.196785
## Proportion of Variance 0.00319 0.002979
## Cumulative Proportion 0.99702 1.000000

55/70
Loadings of the first six components

pitprop.pca$loadings[,1:6]

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6


## topdiam -0.40379 -0.21785 0.20729 0.09121 -0.08263 0.119803
## length -0.40554 -0.18613 0.23504 0.10272 -0.11279 0.162888
## moist -0.12440 -0.54064 -0.14149 -0.07844 0.34977 -0.275901
## testsg -0.17322 -0.45564 -0.35242 -0.05477 0.35576 -0.054017
## ovensg -0.05717 0.17007 -0.48121 -0.04911 0.17610 0.625557
## ringtop -0.28443 0.01420 -0.47526 0.06343 -0.31583 0.052301
## ringbut -0.39984 0.18964 -0.25310 0.06498 -0.21507 0.002658
## bowmax -0.29356 0.18915 0.24305 -0.28554 0.18533 -0.055119
## bowdist -0.35663 -0.01712 0.20764 -0.09672 -0.10611 0.034222
## whorls -0.37892 0.24845 0.11877 0.20504 0.15639 -0.173148
## clear 0.01109 -0.20530 0.07045 -0.80366 -0.34299 0.175312
## knots 0.11508 -0.34317 -0.09200 0.30080 -0.60037 -0.169783
## diaknot 0.11251 -0.30853 0.32611 0.30338 0.07990 0.626307

56/70
Rotation

I A traditional way to simplify loadings is by rotation.

I The method of rotation emerged in Factor Analysis and was


motivated both by solving the rotational indeterminacy
problem and by facilitating the factors’ interpretation.

I Rotation can be performed either in an orthogonal or an


oblique (non-orthogonal) fashion.

I Several analytic orthogonal and oblique rotation criteria do


exist in the literature.

I All criteria attempt to create a loading matrix whose


elements are close to zero or far from zero, with few
intermediate values.

57/70
Rotation

I If A is the loading matrix, then A is post-multiplied by a


matrix T to give rotated loadings B = AT.

I The rotation matrix T is chosen so as to optimize some


simplicity criterion.

I We would also need an algorithm that optimizes the chosen


rotation criterion and finds the “best” T.

I However, after rotation, either one or both of the properties


possessed by PCA, that is, orthogonality of the loadings and
uncorrelatedness of the component scores, is lost.

58/70
The Varimax rotation criterion

I Each variable should be either clearly important or clearly


unimportant in a rotated component, with as few cases as
possible of borderline importance.

I Varimax is the most widely used rotation criterion.

I Varimax tends to drive at least some of the loadings in each


component towards zero.

I A component whose loadings are all roughly equal will be


avoided by most standard rotation criteria.

59/70
Gradient projection algorithm

I Problems in multivariate statistics are often concerned with


the optimization of matrix functions of structured
(e.g. orthogonal) matrix unknowns.

I Gradient projection algorithms are natural ways of solving


such optimization problems as they are especially designed
to follow the geometry of the matrix parameters.

I They are based on the classical gradient approach and


modified for analyzing and solving constrained optimization
problems.

I The idea is to follow the steepest descent direction and to


keep the gradient flow “nailed” to the manifold of
permissible matrices.

60/70
Gradient projection algorithm for orthogonal rotation

I Here, the gradient projection algorithm for orthogonal


rotation is used to find T that minimizes f (V) over all
orthogonal matrices V.

I Let M be the manifold of all orthogonal matrices.

I Given a current value of V, this algorithm computes the


gradient of f at V and moves α units in the negative
gradient direction from V.

I The result is projected on M.

61/70
The Gradient projection algorithm visualized

∂f
V −α
∂V

updated V
V

Manifold of permissible
matrices M

Figure 5: Projection on a manifold of permissible matrices

62/70
Iterative scheme

I The algorithm proceeds iteratively; it is monotonically


descending and converges from any starting point to a
stationary point.

I At a stationary point of f restricted to M, the Frobenius


norm of the gradient after projection onto the plane tangent
to M at the current value of V is zero.

I The algorithm stops when the norm is less than some


prescribed precision, say 10−5 .

I Once the optimal rotation matrix T has been found, the


rotated loading matrix is obtained as B = AT.

63/70
Using the Varimax criterion for Jeffers’ pitprops data

library(GPArotation)
A <- pitprop.pca$loadings[,1:6]
B <- GPForth(A, method="varimax")$loadings
B

## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6


## topdiam -0.4732810 -0.093071 0.066309 -0.035053 -0.047369 0.20604
## length -0.4913803 -0.034979 0.053369 -0.039620 -0.047577 0.23409
## moist 0.0003155 -0.713627 0.148739 -0.001023 0.013383 -0.02893
## testsg -0.0154043 -0.681406 -0.170698 0.018187 0.005942 -0.01546
## ovensg 0.0493026 0.003042 -0.807205 -0.018262 0.138311 0.12255
## ringtop -0.2388960 -0.028431 -0.391257 -0.014618 -0.358794 -0.27123
## ringbut -0.3638911 0.092181 -0.259272 0.063757 -0.130426 -0.28490
## bowmax -0.2472791 0.033791 0.113722 -0.127971 0.439568 -0.12301
## bowdist -0.3980574 0.039518 0.086691 -0.138574 0.073745 0.01454
## whorls -0.3446344 0.052830 0.087417 0.339108 0.212771 -0.16254
## clear -0.0185273 0.009334 -0.008107 -0.916145 0.012019 -0.03890
## knots -0.0318758 0.025179 0.177287 -0.011326 -0.765165 0.02324
## diaknot -0.0516283 0.031716 -0.100118 0.047261 -0.029195 0.82952

64/70
Sparse PCA based on the “elastic net”

I The lasso approach in PCA: perform PCA under the extra


constraints pj=1 |akj | ≤ t for some tuning parameter t
P

(k = 1, . . . , p).

I The above-mentioned approach has several limitations.

I The so-called elastic net generalizes the lasso to overcome its


drawbacks.

I Elastic net approach in PCA: formulate PCA as a


regression-type optimization problem; obtain sparse loadings
by integrating a lasso penalty (via the elastic net) into the
regression criterion.

65/70
Sparse PCA (SPCA) criterion based on the “elastic net”

I Optimization problem:
n k k
||xi −AB> xi ||2 +λ
X X X
(Â, B̂) = arg min ||βj ||2 + λ1,j ||βj ||1
A,B
i=1 j=1 j=1

subject to A> A = Ik .

I In the SPCA criterion above, A = (α1 , . . . , αk ) and


B = (β1 , . . . , βk ) are p × k matrices, and || · || denotes the l1
norm.

I Whereas the same λ is used for all k components, different


λ1,j ’s are allowed for penalizing the loadings of different
principal components.

66/70
Alternating algorithm to minimize the SPCA criterion
I B given A: For each j, let Yj∗ = Xαj . Each β̂j in
B̂ = (β̂1 , . . . , β̂k ) is an elastic net estimate

β̂j = arg min ||Yj∗ − Xβj ||2 + λ||βj ||2 + λ1,j ||βj ||1 .
βj

I A given B: If B is fixed, then we can ignore the penalty


part on the SPCA criterion and only try to minimize
n
||xi − AB> xi ||2 = ||X − XBA> ||2F ,
X

i=1

subject to A> A = Ik . The solution is found via the SVD of

(X> X)B = UDV> ,

and we set  = UV> .


67/70
Some remarks about SPCA

I Empirical evidence suggests that the output of the above


algorithm does not change much as λ is varied.

I Practically, λ is chosen to be a small positive number.

I Usually several combinations of λ1,j are tried to figure out a


good choice of the tuning parameters.

I Hence, we can pick a λ1,j that gives a good compromise


between explained variance and sparsity (variance-sparsity
trade-off).

68/70
Implementation of sparse PCA in R

I Efficient algorithms do exist to fit the elastic net approach


in PCA to multivariate data.

I Sparse PCA is implemented by the function spca() in the


R package elasticnet.

?spca

I The function arrayspc() in the R package elasticnet is


specifically designed for the case p  n, as it is typically the
case in microarrays.

?arrayspc

69/70
Sparse PCA of Jeffers’ pitprops data
pitprop.spcap <- spca(pitprops,K = 6, type = "Gram", sparse = "penalty",
para=c(0.06,0.16,0.1,0.5,0.5,0.5))
pitprop.spcav <- spca(pitprops,K = 6, type = "Gram", sparse = "varnum",
para = c(7,4,4,1,1,1))
pitprop.spcap$loadings

## PC1 PC2 PC3 PC4 PC5 PC6


## topdiam -0.4774 0.00000 0.00000 0 0 0
## length -0.4759 0.00000 0.00000 0 0 0
## moist 0.0000 0.78471 0.00000 0 0 0
## testsg 0.0000 0.61936 0.00000 0 0 0
## ovensg 0.1766 0.00000 0.64065 0 0 0
## ringtop 0.0000 0.00000 0.58901 0 0 0
## ringbut -0.2505 0.00000 0.49233 0 0 0
## bowmax -0.3440 -0.02100 0.00000 0 0 0
## bowdist -0.4164 0.00000 0.00000 0 0 0
## whorls -0.4000 0.00000 0.00000 0 0 0
## clear 0.0000 0.00000 0.00000 -1 0 0
## knots 0.0000 0.01333 0.00000 0 -1 0
## diaknot 0.0000 0.00000 -0.01557 0 0 1

pitprop.spcap$pev

## [1] 0.28035 0.13966 0.13298 0.07445 0.06802 0.06227


70/70

You might also like