You are on page 1of 76

Multivariate Statistical Analysis

• 1. Aspects of Multivariate Analysis
• 2. Principal Components
• 3. Factor Analysis
• 4. Discrimination and Classification
• 5. Clustering
Johnson, R.A., Wichern, D.W. (1982): Applied Multivariate Statistical Analysis,
Prentice Hall.

1

1. Aspects of Multivariate Analysis
Multivariate data arise whenever p ≥ 1 variables are recorded. Values of these
variables are observed for n distinct item, individuals, or experimental trials.
We use the notation xij to indicate the particular value of the ith variable that
is observed on the jth item, or trial.
Thus, n measurements on p variables are displayed as p × n random matrix X:
Variable 1:
Variable 2:
..
Variable i:
..
Variable p:

Item 1
x11
x21
..
xi1
..
xp1

Item 2
x12
x22
..
xi2
..
xp2

...
...
...
..
...
..
...

Item j
x1j
x2j
..
xij
..
xpj

...
...
...
..
...
..
...

Item n
x1n
x2n
..
xin
..
xpn
2

Estimating Moments:
Suppose, E(X) = µ and cov(X) = Σ are the population moments. Based on a
sample of size n, these quantities can be estimated by their empirical versions:
Sample Mean:

n

1X
xij ,
xi =
n j=1

i = 1, . . . , p

Sample Variance:
n
X
¡
¢2
1
2
si = sii =
xij − xi ,
n − 1 j=1

i = 1, . . . , p

Sample Covariance:
sik

n
¢¡
¢
1 X¡
=
xij − xi xkj − xk ,
n − 1 j=1

i = 1, . . . , p ,

k = 1, . . . , p .

3

. p .3:8]) age height weight 30 177 77 fvc 553 fev1 460 fevp 83 4 . . . Assume further. that the p × p population correlation matrix ρ is estimated by the sample correlation matrix R with entries rik = √ sik .k . where rii = 1 for all i = 1. > aimu <.read. header=TRUE) > attach(aimu) > options(digits=2) > mean(aimu[ .Summarize ¡ ¢ all elements sik into the p × p sample variance-covariance matrix S = sik i. . p. k = 1. . . . . . siiskk i = 1. p . . . .dat".table("aimu.

44 -0.5 -233 -302 -20.177 fev1 -0.3:8]) age height weight fvc fev1 fevp age 110 -16.8 height -17 45.683 0.7 5817 4192 -86.3:8]) age height weight fvc fev1 fevp age 1.3 > cor(aimu[ .31 -0.309 height -0.15 0.5 34.62 -0.9 16.043 -0.29 0.24 1.5 fevp -21 -1.00 0.29 -0.619 0.384 fevp -0.2 212.83 -0.9 351 275 -1.00 0.5 fev1 -302 275.043 weight 0.18 0.31 0.494 1.239 0.9 -7.6 fvc -233 351.41 1.00 0.83 1.113 fvc -0.6 325 212 -7.000 0.31 -0.000 5 .41 0.9 109.38 1.11 -0.68 0.5 324.00 -0.15 -0.49 0.9 weight 16 34.> cov(aimu[ .0 4192 4347 162.44 0.6 -87 162 41.

Q) = (x1 − y1)2 + (x2 − y2)2 + · · · + (xp − yp)2 . x2. from P to the origin O = (0. if P has p coordinates so that P = (x1. It is often desirable to weight the coordinates. . y2. . . .Distances: Consider the point P = (x1. . . x2) in the plane. 0) is (Pythagoras) q d(O. The straight line (Euclidian) distance. xp). yp) is given by q d(P. The distance between 2 arbitrary points P and Q = (y1. 6 . P ). P ) = x21 + x22 + · · · + x2p . P ) = x21 + x22 . Each coordinate contributes equally to the calculation of the Euclidian distance. d(O. In general. . . the Euclidian distance is q d(O.

2.1).0. 2)). mu=c(0.0. Sigma=matrix(c(9.mvrnorm(30.1] 7 . 0). Suppose we have n pairs of measurements on 2 independent variables x1 and x2: plot(X) 6 > X <. −6 −4 −2 0 2 4 6 X[.Statistical distance should account for differences in variation and correlation.2] 2 4 Variability in x1 direction is much larger than in x2 direction! Values that are a given deviation from the origin in the x1 direction are not as surprising as are values in x2 direction. It seems reasonable to weight an x2 coordinate more heavily than an x1 coordinate of the same value when computing the distance to the origin. 0 −2 −4 −6 X[.

Compute the statistical distance from the standardized coordinates x∗1 x1 =√ s11 and x∗2 x2 =√ s22 as sµ q d(O. If the coordinate variables vary independent of one other. Q) = + . x2) to any fixed point Q = (y1. s11 s22 The extension to more than 2 dimensions is straightforward. the distance from P to Q is s (x1 − y1)2 (x2 − y2)2 d(P. P ) = (x∗1 )2 + (x∗2 )2 = x1 √ s11 ¶2 µ + x2 √ s22 s ¶2 = x22 x21 + . s11 s22 This can be generalized to accommodate the calculation of statistical distance from an arbitrary point P = (x1. y2). 8 .

The statistical distance from P to Q is s (x1 − y1)2 (x2 − y2)2 (xp − yp)2 d(P. . Suppose the x1 measurements are unrelated to the x2 ones. • If s11 = s22 = · · · = spp. and s11 = 4. All points with constant distance 1 satisfy: x41/4 + x22/1 = 1. an Ellipse centered at (0. . xp) and Q = (y1. . Assume again that Q is fixed. Q) = + + ··· + . 0) by d2(O. x2) with x1 = x2 = 0. . . x2) to (0. s22 = 1. y2. 0). . s11 s22 spp 0 −1 −2 x2 1 2 • The distance of P to the origin is obtained by setting y1 = y2 = · · · = yp = 0. x2. yp).Let P = (x1. x1 9 . −2 −1 0 1 2 Consider a set of paired measurements (x1. . the Euclidian distance is appropriate. . P ) = x21/4 + x22/1. We measure the squared distance of an arbitrary P = (x1.

This definition of statistical distance still does not include most of the important cases because of the assumption of independent coordinates. 2)) > plot(X). 3). > X <. 0). the x1 measurements do not vary independently of x2. 2. abline(0. x1 10 .mvrnorm(30.2. The coordinates exhibit a tendency to be large or small together. abline(0. mu=c(0. -1/3) ~ x2 −4 −2 0 θ −6 x2 2 4 6 ~ x1 −6 −4 −2 0 2 4 6 Here. v=0).9. Moreover.9). we can use what we have already introduced! But before. What is a meaningful measure of distance? Actually.9.2. the variability in the x2 directions is larger than in x1 direction. abline(h=0. we only have to rotate the coordinate system through the angle θ and label the rotated axes x ˜1 and x ˜2 . Sigma=matrix(c(1.

4. R being any other point different to P and Q. Q) = 0 if P = Q. Q) ≤ d(P. d(P. Q) = d(Q. d(P. Q). 3. 0) as s d(O. d(P. R) + d(R. d(P. s˜11 s˜22 where s˜ii denotes the sample variance computed with the (rotated) x ˜i measurements. P ) = x ˜21 x ˜22 + . x2) from the origin (0.Now. 2. Alternative measures of distance can be useful. Q) > 0 if P 6= Q. provided they satisfy the properties 1. 11 . we define the distance of a point P = (x1. P ).

Xp)t have p × p population variancecovariance matrix var(X) = Σ. Yp = `tpX = `1pX1 + `2pX2 + · · · + `ppXp 12 . Xp (data reduction). Let a random vector X = (X1. X2. . . .. Denote the eigenvalues of Σ by λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. Consider the arbitrary linear combinations with fixed vectors `i Y1 = `t1X = `11X1 + `21X2 + · · · + `p1Xp Y2 = `t2X = `12X1 + `22X2 + · · · + `p2Xp . .Principle Components (PCA) Now we try to explain the variance-covariance structure through a few linear combinations of the original p variables X1. .. . . . . X2.

which are uncorrelated and whose variances are as large as possible. Y2. i. . which are of unit length.For these var(Yi) = var(`tiX) = `tiΣ`i cov(Yi. 13 . `tk X) = `tiΣ`k We define as principal components those linear combinations Y1. . . Yp. Yk ) = cov(`tiX. Since increasing the length of `i would also increase the we restrict our P variances.e. . search onto vectors `i. j `2ij = `ti`i = 1.

the ith principal component is the linear combination `Ti X that maximizes var(`tiX) subject to `ti`i = 1 and with cov(`tiX. the second principal component is the linear combination `T2 X that maximizes var(`t2X) subject to `t2`2 = 1 and with cov(`t1X. 14 . How to find all these vectors `i ? We will use well known some results from matrix theory. for k < i (uncorrelated with all the previous ones). the first principal component is the linear combination `T1 X that maximizes var(`t1X) subject to `t1`1 = 1. `t2X) = 0 (uncorrelated with the first one). 2. `tk X) = 0. 3.Procedure: 1.

(λ2. is given by Yi = etiX = e1iX1 + e2iX2 + . i = 1. 15 .Result 1: Let var(X) = Σ and let Σ have the eigenvalue-eigenvector pairs (λ1. If some λi are equal. . . . Thus. cov(Yi. (λp. With this choices var(Yi) = etiΣei = λi . . Then the ith principal component. e1). the principal components are uncorrelated and have variances equal to the eigenvalues of Σ. ep). . the choice of the corresponding coefficient vectors ei. . . are not unique. . p. where λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. . Yk ) = etiΣek = 0 . and hence Yi. + epiXp . . e2).

two. 80 to 90%) of the total population variance (for large p) can be attributed to the first one. the proportion of total variance due to (explained by) the kth principal component is 0< λk <1 λ1 + λ2 + · · · + λp If most (e. . .. . or three principal components. 16 . Then σ11 + σ22 + · · · + σpp = p X var(Xi) = λ1 + λ2 + · · · + λp = i=1 p X var(Yi) .g.Result 2: Let Y1 = et1X. i=1 Thus. then these components can replace the original p variables without much loss of information. Y2 = et2X. the total population variance equals the sum of the eigenvalues. Yp = etpX be the principal components. Consequently.

Y2 = et2X. In particular. .. 17 .Xk √ eki λi = √ σkk are the correlation coefficients between the components Yi and the variables Xk . . eik is proportional to the correlation coefficient between Yi and Xk . Result 3: If Y1 = et1X. Yp = etpX are the principal components from the variance-covariance matrix Σ. then ρYi.The magnitude of eik measures the importance of the kth variable to the ith principal component. .

√ These ellipsoids have axes ±c λiei. Suppose X ∼ Np(µ. Σ) having density function µ 1 −p/2 −1/2 f (x|µ. i = 1. . In the two-dimensional case x = (x1. . . p. Then the µ centered ellipsoids of constant density are (x − µ)tΣ−1(x − µ) = c2 . . Σ) = (2π) |Σ| exp − (x − µ)tΣ−1(x − µ) 2 ¶ . 18 .It is informative to consider principal components derived from multivariate normal random variables. x2)t this equals 1 1 − ρ212 "µ x1 − µ1 √ σ11 ¶2 µ + x2 − µ2 √ σ22 ¶2 µ − 2ρ12 x1 − µ1 √ σ11 ¶µ x2 − µ2 √ σ22 ¶# = c2 .

eigen(sigma.Example: Suppose x = (x1.2] [1. symmetric=TRUE).58939 0.96736 0. Σ).96736 19 . 2) > e <.25340 [2. 0)t and µ Σ= giving ρ12 σ11 = 9 σ12 = 9/4 σ21 = 9/4 σ22 = 1 ¶ √ = (9/4)/ 9 · 1 = 3/4. x2)t ∼ N2(µ.25340 -0. e $values [1] 9. 1).matrix(c(9.41061 $vectors [. 9/4.1] [. 2.] -0. The eigen-analysis of Σ results in > sigma <. with µ = (0. 9/4.] -0.

2]^2+e$vectors[1.7847013 > sqrt(e$values[2])*e$vectors[.1] [1] 0.1]/e$vectors[1.1623767 -0.1]^2 [1] 1 > e$vectors[2.9956024 -0.1] [1] -2.2]^2 [1] 1 0 −1 −2 −3 x2 1 # slopes of major & minor axes > e$vectors[2.2619511 > e$vectors[2.6198741 20 .817507 −3 −2 −1 0 x1 1 2 3 # endpoints of of major&minor axes > sqrt(e$values[1])*e$vectors[.2] [1] 0.2]/e$vectors[1.1]^2+e$vectors[1.2] [1] -3.2 3 # check length of eigenvectors > e$vectors[2.

Thus the principal components lie in the directions of the axes of the constant density ellipsoid. . . λ1 λ2 λp 1 2 1 2 1 2 y + y + · · · + yp λ1 1 λ2 2 λp and this equation defines an ellipsoid (since the λi are positive) in a coordinate system with axes y1. c2 = xtΣ−1x = = 1 1 1 t 2 (e1x) + (et2x)2 + · · · + (etpx)2 . . . y2. . . . then the major axes lies in the direction of e1. If λ1 is the largest eigenvalue. e2. . . Set µ = 0 in what follows. yp lying in the directions of e1. 21 . . The remaining minor axes lie in the directions defined by e2. ep. . ep.These results also hold for p ≥ 2. .

. . . X2. where the diagonal standard deviation matrix V 1/2 is defined as  √  σ11 . Xp)t we now calculate the principal components from Z = (Z1. Zp)t.. . V 1/2 =  √ σpp 22 . Zi = √ σii In matrix notation this equals ³ ´−1 Z = V 1/2 (X − µ) .Principal Components obtained from Standardized Variables Instead of using X = (X1.. . Z2. . where Xi − µi . . . .

Principal Components of Z will be obtained from the eigenvalues λi and eigenvectors ei of ρ of X. . Result 4: The ith principal component of the standardized variables Z with var(Z) = ρ is given by ³ Yi = etiZ = eti V 1/2 Moreover. 23 . i=1 Thus. p X i = 1.Zk = eki λi . . . Theses are. the proportion explained by the kth principal component is λk /p and p ρYi. not the same as the ones derived from Σ. . in general. var(Zi) = p . p . p X i=1 ´−1 var(Yi) = (X − µ) .Clearly E(Z) = 0 and var(Z) = (V 1/2)−1Σ(V 1/2)−1 = ρ.

] 0. and 1. 1).75/2 = 87. symmetric=TRUE).1] [. 0)t and µ Σ= 9 9/4 9/4 1 µ ¶ =⇒ ρ= 1 3/4 3/4 1 ¶ . x2)t ∼ N2(µ.Example cont’ed: Let again x = (x1. e $values [1] 1. 2) > e <.75 0.eigen(rho.70711 The total population variance is p = 2. The eigen-analysis of ρ now results in: > rho <.70711 [2. Σ). with µ = (0.] 0.70711 -0.2] [1.70711 0. 24 . 2.matrix(c(1.5% of this variance is already explained by the first principal component. 3/4.25 $vectors [. 3/4.

The important first component has explained 9.707Z2 = 0. Variables should be standardized.707 = 0.967X1 − 0. 25 . if they are measured on very different scales.6% of the total variability and is dominated by X1 (because of its large variance).707 + 0.707X2 .The principal components from ρ are Y1 Y2 X1 X2 = 0.253X1 − 0.707Z1 − 0. When the variables are standardized however.707 3 1 whereas those from Σ have been Y1 = −0.707 = 0.253X2 Y2 = +0.707Z2 = 0.236X1 − 0.707Z1 + 0. = 0.236X1 + 0.589/10 = 95.707X2 3 1 X2 X1 − 0.967X2 . the resulting variables contribute equally to the principal components.

2 Comp.9980 0.3 Comp.Summarizing Sample Variation by Principal Components So far we have dealt with population means µ and variances Σ.011 0.0061 0.707 7.981 0. > > > > > library(mva) attach(aimu) options(digits=2) pca <.3 29.9581 4.4149 1. If we analyze a sample then we have to replace Σ and µ by their empirical versions S and x.443 10.princomp(aimu[ . 3:8]) summary(pca) Importance of components: Comp.30332 Proportion of Variance 0.992 0.084 0.5 Comp.00016 Cumulative Proportion 0. The eigenvalues and eigenvectors are then based on S or R instead of Σ or ρ.6 Standard deviation 96.4 Comp.1 Comp.00000 > pca$center # the means that were subtracted age height weight fvc fev1 fevp 30 177 77 553 460 83 26 .9998 1.9 0.0019 0.9 0.

17 0.00 1.> pca$scale # the scalings applied to each variable age height weight fvc fev1 fevp 1 1 1 1 1 1 > pca$loadings # a matrix whose columns contain the eigenvectors Loadings: Comp.119 -0.4 Comp.109 0.3 Comp.5 Comp.1 Comp.5 Comp.747 0.164 fevp 0.6 SS loadings 1.3 Comp.00 27 .67 0.17 Cumulative Var 0.741 -0.133 fev1 -0.110 height 0.83 1.246 0.00 Proportion Var 0.624 0.960 weight 0.17 0.00 1.17 0.976 Comp.613 -0.00 1.4 Comp.00 1.50 0.641 0.17 0.212 0.1 Comp.17 0.745 -0.645 0.763 -0.2 Comp.33 0.00 1.6 age -0.17 0.251 fvc -0.2 Comp.

059 Comp.64 -23. 1:2]) > identify(qqnorm(pca$scores[.409 -5.60 52.82 Comp.716 0.1 Comp.6386 0.255 9.0408 -2. 1])).372 -0.06 -147.68 7.8199 9.4171 > plot(pca) # or screeplot(pca) > plot(pca$scores[ .2862 5.84 12. 2])) 28 .> pca$scores 1 2 3 : 78 79 # values of the p principal components for each observation Comp.908 0.130 -0.5 Comp.834 -0.3 22.14 159. identify(qqnorm(pca$scores[.169 11.951 1.42 -82.131 14.009 0.87 -2.2 Comp.6 -1.633 -5.998 4.4 13.40 -6.068 3.

5 −200 −100 0 100 200 Comp. 29 .3 Comp.2 6000 4000 0 2000 Variances 8000 pca Comp.1 Comp.1 Observations 57 and 25 are a bit outside the ellipsoid.50 −50 0 Comp.

Normal Q−Q Plot 0 −50 Sample Quantiles 0 −100 −200 Sample Quantiles 100 50 200 Normal Q−Q Plot 25 57 −2 −1 0 1 Theoretical Quantiles 2 −2 −1 0 1 2 Theoretical Quantiles 30 .

6 Standard deviation 1.86 0.685 0.91 0.3 Comp.584 0.0000 > pca$center age height weight 30 177 77 fvc 553 fev1 460 fevp 83 > pca$scale age height weight 10.4 Comp.4 fvc 75. 3:8]. cor=TRUE) > summary(pca) Importance of components: Comp.078 0.25 0.057 0.69 1.4 31 . we get > pca <.1 Comp.47 0.23 0.princomp(aimu[ .47 0.4 6.7 10.0800 Proportion of Variance 0.14 0.5 Comp.999 1.0011 Cumulative Proportion 0.73 0.2 Comp.If we base the analysis on the sample correlation matrix.8 fev1 65.5 fevp 6.942 0.

110 0.3 Comp.375 Comp.824 -0.2 Comp.316 fvc -0.00 1.633 0.1 age 0.534 fev1 -0.83 1.17 Cumulative Var 0.211 -0.> pca$loadings Loadings: Comp.149 -0.17 0.535 0.446 0.00 32 .67 0.5 Comp.6 -0.373 -0.50 0.650 0.217 0.00 1.6 SS loadings 1.540 fevp Comp.4 Comp.449 0.00 1.00 Proportion Var 0.17 0.270 0.278 0.17 0.5 Comp.643 0.1 Comp.00 1.2 Comp.33 0.17 0.17 0.635 0.00 1.3 Comp.172 -0.402 -0.264 height -0.17 0.168 -0.4 Comp.411 -0.497 weight -0.207 0.674 0.494 -0.541 -0.

33 .1 Apart from observations 57 and 25 the plot appears to be reasonable elliptical.5 Variances 2.5 1.2 1.0 0.1 −1 0 Comp.3 Comp.1 Comp.5 −4 −2 0 2 Comp.5 pca Comp.0 2 2.0 −2 0.

All variables within a group are highly correlated among themselves but have small correlations with variables in a different group. correlations from the group of test scores in French. English. Factor analysis can be considered as an extension of principal component analysis.g. E.. that is responsible for the correlations. 34 . Suppose variables can be grouped by their correlations. random quantities called factors. Mathematics suggest an underlying intelligence factor. It is conceivable that each such group represents a single underlying construct (factor). Both attempt to approximate the covariance matrix Σ. A second group of variables representing physical fitness scores might correspond to another factor.Factor Analysis Purpose of this (controversial) technique is to describe (if possible) the covariance relationships among many variables in terms of a few underlying but unobservable.

. X − µ} = |{z} | {z (p×1) (p×m) (m×1) (p×1) 35 . The factor model postulates that X linearly depend on some unobservable random variables F1. . ²2. called common factors and p additional sources of variation ²1. Fm.. . ²p.. F2. . . called errors or sometimes specific factors. .The Orthogonal Factor Model The p × 1 random vector X has mean µ and covariance matrix Σ. . Xp − µp = `p1F1 + `p2F2 + · · · + `pmFm + ²p or in matrix notation L |{z} F + |{z} ² . . . The factor analysis model is X1 − µ1 = `11F1 + `12F2 + · · · + `1mFm + ²1 X2 − µ2 = `21F1 + `22F2 + · · · + `2mFm + ²2 .

) There are too many unobservable quantities in the model. that the p deviations Xi − µi are expressed in terms of p + m random variables F1. . . cov(F ) = E(F F t) = I   ψ1 0 . which are all unobservable. ψp and F and ² are independent. We assume that E(F ) = 0 . so cov(². . . Notice. 0 0 . (This distinguishes the factor model from a regression model. . so L is the matrix of factor loadings. 0  . .The coefficient `ij is called loading of the ith variable on the jth factor. 36 . Hence we need further assumptions about F and ². . . ²p. . where the explanatory variables Fj can be observed. . Fm and ²1. E(²) = 0 . . F ) = E(²F t) = 0 . 0 cov(²) = E(²²t) = ψ =  0 ψ2 . . . .

This defines the orthogonal factor model and implies a covariance structure for
X. Because of
(X − µ)(X − µ)t = (LF + ²)(LF + ²)t
= (LF + ²)((LF )t + ²t)
= LF (LF )t + ²(LF )t + (LF )²t + ²²t
we have
¡

t

¢

Σ = cov(X) = E (X − µ)(X − µ)
¡
¢ t
¡ t¢ t
¡ t¢
¡ t¢
t
= L E F F L + E ²F L + L E F ² + E ²²
= LLt + ψ .
Since (X − µ)F t = (LF + ²)F t = LF F t + ²F t we further get
¡
¢
¡
¢
t
t
t
cov(X, F ) = E (X − µ)F = E LF F + ²F = L E(F F t) + E(²F t) = L .
37

That proportion of var(Xi) = σii contributed by the m common factors is called
the ith communality h2i . The proportion of var(Xi) due to the specific factor is
called the uniqueness, or specific variance. I.e.,
var(Xi) = communality + specific variance
σii = `2i1 + `2i2 + · · · + `2im + ψi .
With h2i = `2i1 + `2i2 + · · · + `2im we get
2
σii
= h2i + ψi .

The factor model assumes that the p(p + 1)/2 variances and covariances of X
can be reproduced by the pm factor loadings `ij and the p specific variances
ψi. For p = m, the matrix Σ can be reproduced exactly as LLt, so ψ is the
zero matrix. If m is small relative to p, then the factor model provides a simple
explanation of Σ with fewer parameters.
38

Drawbacks:
• Most covariance matrices can not be factored as LLt + ψ, where m << p.
• There is some inherent ambiguity associated with the factor model: let T be
any m × m orthogonal matrix so that T T t = T T = I. then we can rewrite the
factor model as
X − µ = LF + ² = LT T tF + ² = L∗F ∗ + ² .
Since with L∗ = LT and F ∗ = T F we also have
E(F ∗) = T E(F ) = 0 ,

and

cov(F ∗) = T t cov(F )T = T tT = I ,

it is impossible to distinguish the loadings in L from those in L∗. The factors F
and F ∗ have the same statistical properties.
39

. the specific variances play the dominant role. whereas the major aim of factor analysis is to determine a few important common factors. xn on X. factor analysis seeks to answer the question: Does the factor model with a smaller number of factors adequately represent the data? The sample covariance matrix S is an estimator of the unknown population covariance matrix Σ. In these cases. If S deviate from a diagonal matrix then the initial problem is to estimate the factor loadings L and specific variances ψ. Both of these solutions can be rotated afterwards in order to simplify the interpretation of the factors. . . If the off-diagonal elements of S are small. Two methods are very popular: the principal component method and the maximum likelihood method. . the variables are not related and a factor analysis model will not prove useful. 40 .Methods of Estimation With observations x1. x2.

Σ = LLt + 0 = LLt . . . if the last eigenvalues are relatively small we neglect the contributions of λm+1em+1etm+1 + λm+2em+2etm+2 + · · · + λpepetp to Σ above. Thus we define p p p L = ( λ1e1. . . λpep) t to get a factor analysis model with as many factors as variables (m = p) and specific variances ψi = 0 for all i i. This is not very useful. Then Σ = λ1e1et1 + λ2e2et2 + · · · + λpepetp . however.e. ei) with λ1 ≥ λ2 ≥ · · · ≥ λp. 41 . λ2e2.The Principal Component Approach: Let Σ have eigenvalue-eigenvector pairs (λi.

If specific factors are included in the model. This representation assumes that the specific factors ² are of minor importance. where L is now a (m × p) matrix of coefficients as required.This gives us the approximation Σ ≈ λ1e1et1 + λ2e2et2 + · · · + λmemetm = LLt . where ψi = 2 σii − P 2 j `ij . 42 . their variances may be taken to be the diagonal elements of Σ − LLt and the approximation becomes Σ ≈ LLt + ψ .

Applying the above technique onto S or R is known as the principal component solution. it is customary first to center the observations (this does not change the sample covariance structure) and to consider xj − x. P ˆ2 ˆ By the definition of ψi = sii − j `ij . t ˆ ˆ ˆ However. If the units of the variables are not of the same√size then it is desirable to work with the standardizes variables zij = (xij − xi)/ sii having sample variance R. the diagonal elements of S are equal to the diagonal elements of LL + ψ. where ˆ`i are the eigenvectors of S (or t ˆ ˆ ˆ R).To apply this approach to data. 43 . the off-diagonal elements of S are not usually reproduced by LL + ψ.

the contributions of the first few factors to the sample variance should be large. Ideally.• How to determine the number of factors. m? Consider the residual matrix of a m factor model ¡ t ˆ ˆ ˆ S − LL + ψ ¢ with zero diagonal elements. The contribution to the sample variance sii from the first common factor is `ˆ2i1. s11 + s22 + · · · + spp. The contribution to the total sample variance. If the other elements are also small we will take the m factor model to be appropriate. ˆ 1e ˆ 1e ˆ1 ˆ1 = λ `ˆ2i1 = λ λ i=1 44 . from the first common factor is p X µq ¶t µq ¶ ˆ1 .

or ˆj λ p for a factor analysis of R. (Be careful when using these rules blindly!) 45 .In general. the proportion of total sample variance due to the jth factor is ˆj λ s11 + s22 + · · · + spp for a factor analysis of S. or equal m to the number of positive eigenvalues of S. Software packages sometimes set m equal to the number of eigenvalues of R largen than 1 (if the correlation matrix is factored).

0.79.0.0.20449022 0.50. Taste Good buy for money Flavor Suitable for snack Provides lots of energy 1.0. 0.00.96.42 0.0.71 0. 0.0.11.11.71.0.00 0.02. The correlation matrix of the responses was calculated.02.42.11 0.96.0.1.00.00).13 0.50 0.79 1.01 0. 5) > eigen(R) $values [1] 2.85 0.13.50 1.matrix(c(1. a number of customers were asked to rate several attributes of a new product.01.79 0.96 0.0.96 0.0.Example: In a consumer-preference study.11 0.0.13 1.71 0.00 0.1.85.13.02 0.03367744 46 .0.02 1.0.85 0.0.85309042 1.1.1.00 0.00.50.00 0.0.85.0. 0.79.71.10240947 0. 5.80633245 0.01 0.00 > library(mva) > R <.00. 0.42 0.42.01.

55650828 0.] [.5559769 -0.1386643 0.853 + 1.5682357 0.] [3. and specific variances (uniquenesses).806 = 0.1170037 -0.708716714 0.07806457 -0.74256408 -0.40418799 -0.4601593 -0. 47 .5] 0.001656352 0.3] [. Thus we decide to set m = 2.60721643 0.4] [. Hence we directly calculate those quantities.1] [.3820572 0. communalities.] [4.] [2. These two will account for 2.09848524 0.$vectors [1.2] [.39003172 0. There is no special function available in R allowing to get the estimated factor loadings.16840896 0.60158211 -0.071674637 0.93 5 of the total (standardized) sample variance.7513990 0.009012569 The first 2 eigenvalues of R are the only ones being larger than 1.2821170 0.4725608 -0.701783012 0.] [5.3314539 0.22053713 0.

diag(R) . 2) # factor loadings > for (j in 1:2) L[ .0012 [4.075 -0.] 0.543 > h2 <.2] [.020 0.524 [3.976 0.0205 -0.0205 0.5] [1.645 0.0126 0.] 0. 10).075 -0.017 0.0241 0.0117 -0.0552 [3.] -0.000 0. h2 # communalities [1] 0.matrix(rep(0.0000 0.0064 [2.(L %*% t(L) + diag(psi)) # residuals [.] 0.893 0.h2.0012 -0.1] [.] -0.748 [4.798 -0.0166 [5.0064 -0.020 0. 5.939 -0.] 0.0000 48 .] 0.0117 0. psi # specific variances [1] 0.816 [2.diag(L %*% t(L)).1] [.] 0.013 -0.4] [.0201 -0.j] <.2] [1.560 0.0000 -0.000 -0.3] [.028 0.] 0.0678 > R .1071 0.932 > psi <.1211 0.105 [5.j] [.0276 0.777 -0.979 0.879 0.> L <.] 0.sqrt(eigen(R)$values[j]) * eigen(R)$vectors[ .055 0.

If ρ = LLt + ψ is correctly specified. the 1 replaced by h2i the resulting matrix is ρ − ψ = LLt.Thus we would judge a 2-factor model providing a good fit to the data. as well as the communality portions of the diagonal elements ρii = 1 = h2i + ψi . If the specific factor contribution ψi is removed from the diagonal or. 49 . equivalently. then the m common factors should account for the offdiagonal elements of ρ. A Modified Approach – The Principle Factor Analysis We describe this procedure in terms of a factor analysis of R. The large communalities indicate that this model accounts for a large percentage of the sample variance of each variable.

As initial choice of h∗2 i −1 ∗ ii ii you can use 1 − ψi = 1 − 1/r . . which is now factored as Rr ≈ L∗r L∗t r . 50 . where r is the ith diagonal element of R . The principle factor method of factor analysis employs the estimates ·q L∗r = ¸ q q ∗ ∗ ∗ ˆ ∗e ˆ ∗e ˆ∗ e ˆ ˆ ˆ λ . j=1 ∗ ˆ ∗. . Re-estimate i the communalities again and continue till convergence. Then we replace the ith diagonal ∗ element of R by h∗2 i = 1 − ψi . and obtain the reduced correlation matrix Rr . . λ m m 1 1 2 2 and ψˆi∗ = 1 − m X `∗2 ij . .Suppose initial estimates ψi∗ are available. λ . e ˆ where (λ i ) are the (largest) eigenvalue-eigenvector pairs from Rr .

93 0.93 51 .star L.star <.sqrt(eigen(R. [1] 0. 5.h2.r)$values[j]) * eigen(R.j] <.star <. diag(R.83 0.j] <.r)$vectors[ .R.74 0.star # communalities [1] 0.diag(L.j] h2.star)).1/diag(solve(R)).R.r <.matrix(rep(0.star <.95 0.96 0.r <.star # communalities [1] 0.97 0.83 0.94 0.diag(L.83 > > > > h2 # initial guess R.80 0. 10).r) <.Example cont’ed: > h2 <.88 > # apply 3 times to get convergence > > > > R.j] h2.r) <.h2 L.r)$vectors[ .star[ .77 0.star <.star %*% t(L.sqrt(eigen(R. h2.1 . 5. diag(R. h2.star)).95 0.matrix(rep(0. 2) # factor loadings for (j in 1:2) L.r)$values[j]) * eigen(R.star %*% t(L.star[ . 2) # factor loadings for (j in 1:2) L. 10).76 0.

star # loadings [.90 0.167 0. which is strongly recommended and shortly discussed now.star # specific variances [1] 0.] -0.039 0.] -0.> L.71 [4. The only estimating procedure available in R is the maximum likelihood method.15 [5.069 The principle components method for R can be regarded as a principal factor method with initial communality estimates of unity (or specific variance estimates equal to zero) and without iterating.60 -0.h2.77 0.231 0.78 [2.] -0. Beside the PCA method this is the only one.58 > 1 .71 0.032 0.51 [3. 52 .68 -0.1] [.2] [1.] -0.] -0.

This strategy is the only one which is implemented in R and is now applied onto our example.168 0.237 0. covmat = R. factors=2) Call: factanal(factors = 2.Maximum Likelihood Method We now assume that the common factors F and the specific factors ² are from a normal distribution.040 0. Example cont’ed: > factanal(covmat = R. Then maximum likelihood estimates of the unknown factor loadings L and the specific variances ψ may be obtained.052 Loadings: 53 . rotation = "none") Uniquenesses: [1] 0.028 0.

738 0.] [4.] [2.963 Factor1 Factor2 SS loadings 2.976 -0.[1.45 Cumulative Var 0.] Factor1 Factor2 0.979 0.] [3.146 0.] [5.150 0.23 Proportion Var 0.90 54 .535 0.45 0.139 0.24 2.860 0.45 0.

019 0.] 0.524 0.979 [4.] [3.2] [1.336 [5.816 0.843 0.987 -0.811 0.798 -0.131 55 .560 0.013 [3.] 0. `ˆi2).] -0.] 0.124 [3.983 [4.645 0.Factor Rotation Since the original factor loadings are (a) not unique. Example cont’ed: Estimates of the factor loadings from the principal component approach were: > L [1.] 0.] [5.017 > promax(L) [.] 0.965 -0.] 0.939 -0.130 0.1] [.989 [2. yields p points.748 0.2] 0.1] [.1] [.543 > varimax(L) [.] 0. These points can be rotated by using either the varimax or the promax criterion.937 -0. and (b) usually not interpretable. each point corresponding to a variable. We concentrate on graphical methods for m = 2.] [.777 -0.958 -0.427 [5. A plot of the pairs of factor loadings (`ˆi1. we rotate them until a simple structure is achieved.] 0.] [2.105 0.] 0.2] [1.] [4.093 1.007 [2.021 0.

5 0.0 l2 Snack −0. 4 (Snack).0 0.5 1. small loadings on factor 1).0 1.0 −0.5 Snack 0.5 Flavor Snack Good buy Energie −0.5 0.0 1.5 0. while variables 1 (Taste) and 3 (Flavor) define factor 2 (high loadings on factor 2.0 l2 0.5 l1 1.5 l1 Energie 1.0 Taste Flavor Taste Flavor Taste 0. 56 .0 l2 0.1.5 −0.0 0. small loadings on factor 2). and 5 (Energy) define factor 1 (high loadings on factor 1.0 l1 After rotation it’s much clearer to see that variables 2 (Good buy).5 0.5 0.0 0.5 Good buy Good buy Energie −0.0 −0. Johnson & Wichern call factor 1 a nutrition factor and factor 2 a taste factor.

the estimated values of the common factors.. may also be required (e... for diagnostic purposes). called factor scores. These scores are not estimates of unknown parameters in the usual sense. Two methods are provided in factanal( .g. scores = ): the regression method of Thomson. However. Both these methods allows us to plot n such p-dimensional observations as n m-dimensional scores.Factor Scores In factor analysis. interest is usually centered on the parameters in the factor model. 57 .. and the weighted least squares method of Bartlett. They are rather estimates of values for the unobserved random factor vectors.

993 58 . • apply a varimax rotation on these estimates and check plot of the loadings.008 0.295 FEV1.951 0.VC 0.523 0.378 -0.005 Loadings: Factor1 Factor2 age -0.153 VC 0. 3:8]. rotation="none"). scores="none".005 0.VC 0.960 -0. • estimate factor scores and plot them for the n observations.270 FEV1 0.834 0. factors=2. > fa <.782 0.Example: A factor analytic analysis of the fvc data might be as follows: • calculate the maximum likelihood estimates of the loadings w/o rotation.274 height 0.factanal(aimu[.109 weight 0.682 -0. fa Uniquenesses: age height weight VC FEV1 FEV1.378 -0.

VC -0.2937 [2.640 > L <.2937 0.97716 $rotmat [.9559 0.587 1.varimax(fa$loadings).] 0.Factor1 Factor2 SS loadings 2.431 0. Lv $loadings Factor1 Factor2 age -0.9972 0.56122 FEV1.4057 -0.2004 0.03488 VC 0.8225 0.9559 59 .02385 FEV1 0.] -0.256 Proportion Var 0.6841 0.1] [.09667 weight 0.fa$loadings > Lv <.431 0.2] [1.2810 -0.209 Cumulative Var 0.37262 height 0.

region).identify(s.5 1.1. scores="reg".VC l2 l2 FEV1 0. factors=2.0 l1 > s <.5 l1 1.0 0.5 0.VC FEV1 0.0 > plot(L).0 height weight VC weight height VC age −0. i <. 3:8]. rot="varimax")$scores > plot(s).5 0. plot(Lv) FEV1. aimu[i.5 0. ] 60 .factanal(aimu[.5 age −0.0 −0.0 1.0 0.5 FEV1.0 0.5 −0.

3 1 2 M −2 −1 0 M A M M −3 f2 25 38 46 57 71 nr year age height weight VC FEV1 FEV1.VC region 25 85 28 189 85 740 500 68 A 38 83 44 174 78 475 335 71 M 46 83 23 190 75 665 635 95 M 57 83 25 187 102 780 580 81 M 71 83 37 173 78 590 400 68 M −3 −2 −1 0 1 2 3 f1 61 .

that is used to optimally assign a new object to the labelled classes. 62 . or groups). we derive a rule. Thus. allocation) To sort objects into 2 or more labelled classes.Discrimination and Classification Discriminant analysis (DA) and classification are multivariate techniques concerned with separating distinct sets of objects (observations) and with allocating new objects to previously defined groups (defined by a categorial variable). There are several purposes for DA: • (Discrimination. • (Classification. the differential features of objects from several known collections (populations. separation) To describe either graphically (low dimension) or algebraically.

.160 160 170 170 180 180 x2 190 190 200 200 210 210 Consider 2 classes. g2. Thus we assume that all objects x in class i have density fi(x). . Xp). i = 1. Label these groups g1. 2. X2. 110 120 130 140 150 110 120 130 140 150 x1 63 . The observed values differ to some extend from one class to the other. The objects are to be classified on the basis of measurements on a p variate random vector X = (X1. . .

i = 1. We define µ1 = E(X|g1) . If we let µ1Y be the mean of Y obtained from X belonging to g1. and µ2 = E(X|g2) and suppose the covariance matrix ¡ t ¢ Σ = E (X − µi)(X − µi) .Technique was introduced by R. 64 . 2 is the same for both populations (somewhat critical in practice). and µ2Y be the mean of Y obtained from X belonging to g2.A. Fisher. He considered linear combinations of x. then he selected the linear combination that maximized the squared distance between µ1Y and µ2Y relative to the variability of the Y ’s. His idea was to transform the multivariate x to univariate y such that the y’s derived from population g1 and g2 were separated as much as possible.

We consider the linear combination Y = `tX and get population-specific means µ1Y = E(Y |g1) = E(`tX|g1) = `tµ1 µ2Y = E(Y |g2) = E(`tX|g2) = `tµ2 but equal variance σY2 = var(Y ) = var(`tX) = `t cov(X)` = `tΣ` . The best linear combination is derived from the ratio (δ = µ1 − µ2) (µ1Y − µ2Y )2 (`tµ1 − `tµ2)2 `t(µ1 − µ2)(µ1 − µ2)t` (`tδ)2 = = t = t t σY2 ` Σ` ` Σ` ` Σ` 65 .

then (`tδ)2 `tΣ` is maximized by the choice −1 ` = cΣ −1 δ = cΣ ¡ ¢ µ1 − µ2 for any c 6= 0. 66 . Choosing c = 1 produces the linear combination t ¢t Y = ` X = (µ1 − µ2 Σ−1X which is known as Fisher’s linear discriminant function.Result: Let δ = µ1 − µ2 and Y = `tX.

Thus the classification rule is: Allocate x0 to g1 if: Allocate x0 to g2 if: ¢t y0 = (µ1 − µ2 Σ−1x0 ≥ m ¢t −1 y0 = (µ1 − µ2 Σ x0 < m .¢t We can also employ this result as classification device. If X 0 is from g2. It can be shown that E(Y0|g1) − m ≥ 0 and E(Y0|g2) − m < 0 That is. if X 0 is from g1. Y0 is expected to be smaller. Y0 is expected to be larger than the midpoint. Let y0 = (µ1 −µ2 Σ−1x0 be the value of the discriminant function for a new observation x0 and let ¢t −1 ¢ 1 1 m = (µ1Y + µ2Y ) = (µ1 − µ2 Σ (µ1 + µ2 2 2 be the midpoint of the 2 univariate population means. 67 .

Hence we use the pooled sample covariance matrix Sp = (n1 − 1)S 1 + (n2 − 1)S 2 n1 + n2 − 2 an unbiased estimate of Σ. and S p in the previous formulas to give Fisher’s sample linear discriminant function ¢t −1 t ˆ y = ` x = (x1 − x2 S p x . S 2. x2. and Σ by their empirical versions. and sample covariance matrices S 1. we combine (pool) S 1 and S 2 to derive a single estimate of Σ. Since it is assumed that the covariance matrices in the groups are the same. Now. and Σ are replaced by x1. we replace µ1. Suppose we have 2 data matrices X 1 from g1 and X 2 from g2 with n1 and n2 observations. 68 .Because the population moments are not known. µ2. x2. µ1. µ2. from which we calculate both sample means x1.

The midpoint between both sample means is ¢t −1 ¢ 1 m ˆ = (x1 − x2 S p (x1 + x2 2 and the classification rule becomes Allocate x0 to g1 if: Allocate x0 to g2 if: ¢t (x1 − x2 S −1 ˆ p x0 ≥ m ¢t −1 ˆ. (x1 − x2 S p x0 < m This idea can be easily generalized onto more than 2 classes. instead of using a linear discriminant function we can also use a quadratic one. 69 . Moreover.

.W.3333333 0.L. > library(MASS) > data(iris3) > Iris <.W.326 s 5.936 2.2].W.L..1].552 2.+Sepal. Iris. Petal.lda(Sp ~ Sepal.1)/3) Prior probabilities of groups: c s v 0. c 5.data.3333333 0. There are 50 observation for each species. Versicolor.L.frame(rbind(iris3[.W.+Petal.770 4.3]).026 70 .Example: Fisher’s Iris data Data describing the sepal (Kelchblatt) width and length.974 5. Sp = rep(c("s".L. Petal. iris3[.246 v 6. Virginica) were observed.588 2.428 1. prior = c(1.3333333 Group means: Sepal.260 1. iris3[.1.+Petal.006 3."v")."c". Sepal..3))) > z <.462 0. and the petal (Bltenblatt) width and length of 3 different Iris species (Setosa.. rep(50.

W. -1.8104603 2.Coefficients of linear discriminants: LD1 LD2 Sepal.16452123 Petal.8293776 0.L.W.83918785 Proportion of trace: LD1 LD2 0.0088 > predict(z.9912 0.5344731 2. [1] s s s [32] s s s [63] c c c [94] c c c [125] v v v Iris)$class s s s s s s s s s s s s c c c c c v c c c c v v v v v v v v s s c v c s s c v v s s c v v s s c v v s s c v v s s c v v s s c v v s s c v v s s c v v s s c v v s c c v v s c c v v s c v v v s c c v v s c c v v s c c v v s c c v v s c c v s c c v s c c v s c c v s c c v 71 .02410215 Sepal.93192121 Petal. -0.L. 2.2012117 -0. 2.

1)/3. prior = c(1.L. Iris.> table(predict(z..lda(Sp ~ Sepal.L.1.sample(1:150. table(Iris$Sp[train]) c s v 24 25 26 > z1 <.+Sepal.+Petal. Iris)$class. 75).W. ])$class [1] s s s s s s s s s s s s s s s s s s s s s s s s s c c c c c c [32] c c c c c c v c c c c c c c c c c c c c v v v v v v v v v v v [63] v v v c v v v v v v v v v > table(predict(z1.W. Iris[-train. Iris$Sp) c s v c 48 0 1 s 0 50 0 v 2 0 49 > train <. ]$Sp) c s v c 25 0 1 s 0 25 0 v 1 0 23 72 . Iris[-train. subset = train) > predict(z1. ])$class.+Petal. Iris[-train.

plot(z1) v 0 v s s s sss s ssssss s ssss sssss s v vvv vv c v v vvvv c cv v v c v v v vv v v v c cccc cccc cc c v c c v c c c c −5 s vvv v vvvvvv v cc c v vvvvv v vv v v c c ccc cccccc c vvvvvvvvvvv v c v c c c c ccccccccc v cvv vv v c c cc cc cc v cc c vv c LD2 0 s sss s ss s ssssssss ss sssss s s s sssss sssssss −5 LD2 5 5 > plot(z). −10 −5 0 LD1 5 −10 −5 0 5 LD1 73 .

3. pch=3.ld. Iris)$x # => LD1 and LD2 coordinates > eqscplot(ir. mkh=0. Iris$Sp. Iris$Sp. LD2 > # calc group-spec.m. Iris$Sp)$means.782550 0. mean) c s v -0. 1].2151330 0. 2].ld[ .ld[ .607600 5.607600 0. xlab="First LD".825049 -0. col=2) # plot group means as "+" 74 .7278996 0. scaled axes > text(ir. ylab="Second LD") # eq.character(Iris$Sp)) # plot LD1 vs.lda(ir.m <.ld <.ld. mean) c s v 1.5127666 > # faster alternative: > ir. ir.5127666 > points(ir. means of LD1 & LD2 > tapply(ir.782550 > tapply(ir. type="n".ld. as.2151330 v 5.7278996 s -7.825049 -7.m LD1 LD2 c 1.> ir.predict(z.

.m[2.. col=1) perp(ir.]. col=2) perp(ir.m[1.-(x[1]-y[1])/(x[2]-y[2]) abline(c(m[2]-s*m[1]. ir.].].> + + + + > > > > perp <. s).) invisible() } perp(ir.m[2.].m[1.function(x. col=3) # midpoint of the 2 group means # perpendicular line through midpoint # draw classification regions # classification decision b/w groups 1&2 # classification decision b/w groups 1&3 # classification decision b/w groups 2&3 75 ..]. y.].m[3. ir. ir.m[3.(x+y)/2 s <. .. .) { m <.

5 s s 0 ss v v vv v v v v v v v vv v vv c c c vv v vv v vv c c cccc cccccc c vv v vvvv v v v v cccccc c cc v vv v c cv c ccc cc v v v c c c ccc v c c cc cc c c vv c ss s sssssss s s s s ssss s s s sssssss s ss ssssssssss s v −5 Second LD s −10 −5 0 First LD 5 76 .